MSDS 7331 Lab One: Visualization and Data Preprocessing¶
Authors: Jaren Shead, Kristin Henderson, Tom Hines¶
1. Setup and Data Import¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import re
%matplotlib inline
# df = pd.read_csv('data/diabetes+130-us+hospitals+for+years+1999-2008/diabetic_data.csv')
df.head()
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | ... | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2278392 | 8222157 | Caucasian | Female | [0-10) | ? | 6 | 25 | 1 | 1 | ... | No | No | No | No | No | No | No | No | No | NO |
| 1 | 149190 | 55629189 | Caucasian | Female | [10-20) | ? | 1 | 1 | 7 | 3 | ... | No | Up | No | No | No | No | No | Ch | Yes | >30 |
| 2 | 64410 | 86047875 | AfricanAmerican | Female | [20-30) | ? | 1 | 1 | 7 | 2 | ... | No | No | No | No | No | No | No | No | Yes | NO |
| 3 | 500364 | 82442376 | Caucasian | Male | [30-40) | ? | 1 | 1 | 7 | 2 | ... | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 4 | 16680 | 42519267 | Caucasian | Male | [40-50) | ? | 1 | 1 | 7 | 1 | ... | No | Steady | No | No | No | No | No | Ch | Yes | NO |
5 rows × 50 columns
df.shape
(101766, 50)
2. Business Understanding¶
Describe the purpose of the data set you selected (i.e. why was this data collected in the first place?).¶
The dataset comprises 101,766 hospital records of inpatients diagnosed with diabetes who had hospital stays of 1-14 days, collected from 130 U.S. hospitals between 1999 and 2008. While evidence highlights the benefits of preventative and therapeutic interventions in improving outcomes for diabetic patients, diabetes care during hospital stays often appears inconsistent. This dataset was collected to study early readmission rates within 30 days of discharge, aiming to enhance patient outcomes, reduce morbidity and mortality, and lower hospital management costs.
Source: Strack, B., et al. (2014). UCI Machine Learning Repository: Diabetes 130-US hospitals for years 1999–2008 Data Set. Retrieved from https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008
Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset?¶
The main goal of this dataset is to accurately classify hospital readmissions into three categories: within 30 days, after 30 days, or no readmission. This is important because high readmission rates lead to higher healthcare costs, worse patient outcomes, and added pressure on hospital resources. Identifing variables associated with readmission risk provide valuable information to providers for providing targeted treatments. Identifying these variables to achieve good model performace will indicate that we have extracted useful insights from these data.
How would you measure the effectiveness of a good prediction algorithm?¶
A good prediction algorithm for this objective should prioritize Sensitivity, ensuring it correctly identifies patients at risk of early readmission. This is critical for enabling early interventions and preventative treatments. Once a predefined Sensitivity threshold is achieved (e.g. ~80%), the algorithm's Positive Predictive Value (PPV) should then be optimized to minimize false positives, ensuring resources are efficiently allocated and unnecessary treatments are avoided
We will also split the data into training and validation sets, using cross-validation to evaluate the algorithm's performance on unseen data.
3. Data Understanding¶
Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.¶
3.1 Data Attributes¶
- ID Variables
- encounter_id: Unique identifier of an encounter.
- patient_nbr: Unique identifier of a patient.
- Demographic Variables
- race: Patient's race (Categorical Levels: Caucasian, Asian, African American, Hispanic, other).
- gender: Patient's gender (Categorical Levels: male, female, unknown/invalid).
- age: Age grouped in 10-year intervals (Categorical Levels: [0-10), [10-20), ..., [90-100)).
- weight: Patient's weight in pounds (Categorical).
- Admission and Discharge Details
- admission_type_id: Type of admission (Categorical: 9 integer levels representing e.g., Emergency, Urgent, Elective).
- discharge_disposition_id: Type of discharge (Categorical: 29 integer levels representing e.g., Discharged to home, Expired).
- admission_source_id: Source of admission (Categorical: 21 integer levels representing e.g., Physician Referral, Emergency Room).
- Encounter Information
- time_in_hospital: Number of days in the hospital (Continuous).
- payer_code: Payer for medical services (Categorical: 18 unique levels, e.g., Blue Cross, Medicare).
- medical_specialty: Specialty of admitting physician (Categorical: 73 unique levels, e.g., Cardiology, Internal Medicine).
- Medical History and Diagnostics
- num_lab_procedures: Number of lab tests performed (Continuous).
- num_procedures: Number of procedures performed (excluding lab tests) (Continuous).
- num_medications: Number of distinct medications administered (Continuous).
- number_outpatient: Number of outpatient visits in the prior year (Continuous).
- number_emergency: Number of emergency visits in the prior year (Continuous).
- number_inpatient: Number of inpatient visits in the prior year (Continuous).
- diag_1, diag_2, diag_3: Primary and secondary diagnoses (ICD-9 codes) (Categorical: integer values representing 848, 923, and 954 levels, respectively).
- number_diagnoses: Total number of diagnoses recorded (Continuous).
- Lab Results
- max_glu_serum: Glucose serum test results (Categorical Levels: >200, >300, normal, none).
- A1Cresult: Hemoglobin A1c test results (Categorical Levels: >8, >7, normal, none).
- Medications
- Variables for medications: Indicate whether prescribed and dosage changes (Categorical Levels: up, down, steady, no).
- Drug Classes:
- Biguanides reduce glucose production in the liver and improve insulin sensitivity.
- metformin: a first-line treatment for type 2 diabetes
- Meglitinides stimulate insulin secretion but act faster and have a shorter duration compared to sulfonylureas.
- repaglinide
- nateglinide
- Sulfonylureas 1st Generation stimulate the pancreas to produce more insulin. These are older drugs with lower potency and a higher risk of side effects, such as hypoglycemia.
- chlorpropamide
- acetohexamide
- tolbutamide
- tolazamide
- Sulfonylureas 2nd Generation stimulate the pancreas to produce more insulin. These are newer, more potent drugs with shorter half-lives, reducing the risk of hypoglycemia.
- glimepiride
- glipizide
- glyburide
- Thiazolidinediones (TZDs) improve insulin sensitivity in muscle and fat tissue. They can have serious side effects, including weight gain and fluid retention. (They have been withdrawn in many markets due to evidence of liver toxicity.)
- pioglitazone
- rosiglitazone
- troglitazone
- Alpha-Glucosidase Inhibitors delay carbohydrate digestion and absorption in the intestines, reducing post-meal blood sugar spikes.
- acarbose
- miglitol
- DPP-4 Inhibitors enhance incretin hormone levels to increase insulin release and decrease glucagon.
- citoglipton: An unknown drug and potential error in the dataset; it may be a misspelling of sitagliptin.
- Insulin and Combinations: Combination drugs combine two medications to simplify treatment and target multiple mechanisms for blood sugar control.
- insulin directly supplements or replaces natural insulin in the body, essential for type 1 and advanced type 2 diabetes management.
- glyburide-metformin
- glipizide-metformin
- glimepiride-pioglitazone
- metformin-rosiglitazone
- metformin-pioglitazone
- Non-Diabetes Medications
- examide is a diuretic primarily prescribed for cardiovascular or renal issues, not blood sugar control.
- Biguanides reduce glucose production in the liver and improve insulin sensitivity.
- Medication Outcome Variables
- change: Indicates if diabetic medications were changed during the encounter (Categorical Levels: change, no change).
- diabetesMed: Indicates if diabetic medications were prescribed (Categorical Levels: yes, no).
- Target Variable
- readmitted: Readmission status (Target Classes: <30, >30, No).
Metadata Categories¶
- admission_type_id
- 1: Emergency; 2: Urgent; 3: Elective; 4: Newborn; 5: Not Available; 6: NULL; 7: Trauma Center; 8: Not Mapped
- discharge_disposition_id
- 1: Discharged to home; 2: Discharged/transferred to another short-term hospital; 3: Discharged/transferred to SNF; 4: Discharged/transferred to ICF; 5: Discharged/transferred to another type of inpatient care institution; 6: Discharged/transferred to home with home health service; 7: Left AMA; 8: Discharged/transferred to home under care of Home IV provider; 9: Admitted as an inpatient to this hospital; 10: Neonate discharged to another hospital for neonatal aftercare; 11: Expired; 12: Still patient or expected to return for outpatient services; 13: Hospice / home; 14: Hospice / medical facility; 15: Discharged/transferred within this institution to Medicare approved swing bed; 16: Discharged/transferred/referred to another institution for outpatient services; 17: Discharged/transferred/referred to this institution for outpatient services; 18: NULL; 19: Expired at home (Medicaid only, hospice); 20: Expired in a medical facility (Medicaid only, hospice); 21: Expired, place unknown (Medicaid only, hospice); 22: Discharged/transferred to another rehab facility including rehab units of a hospital; 23: Discharged/transferred to a long-term care hospital; 24: Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare; 25: Not Mapped; 26: Unknown/Invalid; 27: Discharged/transferred to a federal health care facility; 28: Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital; 29: Discharged/transferred to a Critical Access Hospital (CAH); 30: Discharged/transferred to another type of health care institution not defined elsewhere
- admission_source_id
- 1: Physician Referral; 2: Clinic Referral; 3: HMO Referral; 4: Transfer from a hospital; 5: Transfer from a Skilled Nursing Facility (SNF); 6: Transfer from another health care facility; 7: Emergency Room; 8: Court/Law Enforcement; 9: Not Available; 10: Transfer from critical access hospital; 11: Normal Delivery; 12: Premature Delivery; 13: Sick Baby; 14: Extramural Birth; 15: Not Available; 17: NULL; 18: Transfer from Another Home Health Agency; 19: Readmission to Same Home Health Agency; 20: Not Mapped; 21: Unknown/Invalid; 22: Transfer from hospital inpatient/same facility resulting in a separate claim; 23: Born inside this hospital; 24: Born outside this hospital; 25: Transfer from Ambulatory Surgery Center; 26: Transfer from Hospice
# # Display the variable metadata in a more readable table format
# metadata = pd.read_csv('data/diabetes+130-us+hospitals+for+years+1999-2008/IDS_mapping.csv')
# print(metadata.to_markdown(index=False))
A. Missing Data Summary Statistics and Visualization¶
This identifies columns with NaNs.
# Find the NaNs
nan_sums = df.isna().sum()
nan_sums_nonzero = nan_sums[nan_sums != 0]
nan_sums_nonzero
# Calculate percentages
nan_percentages = (nan_sums_nonzero / len(df)) * 100
# Create a DataFrame of NaN counts and percentages
nan_summary = pd.DataFrame({
'NaN Count': nan_sums_nonzero,
'NaN Percentage': nan_percentages
}).sort_values(by='NaN Count', ascending=False)
nan_sums_nonzero
nan_summary
| NaN Count | NaN Percentage | |
|---|---|---|
| max_glu_serum | 96420 | 94.746772 |
| A1Cresult | 84748 | 83.277322 |
This identifies columns with '?' values.
# Find `?` values and decide how to deal with them
questionmark_sums = (df == '?').sum()
questionmark_sums_nonzero = questionmark_sums[questionmark_sums != 0]
questionmark_sums_nonzero
# Calculate percentages
questionmark_percentages = (questionmark_sums_nonzero / len(df)) * 100
# Create a DataFrame of NaN counts and percentages
questionmark_summary = pd.DataFrame({
'? Count': questionmark_sums_nonzero,
'? Percentage': questionmark_percentages
}).sort_values(by='? Count', ascending=False)
questionmark_sums_nonzero
questionmark_summary
| ? Count | ? Percentage | |
|---|---|---|
| weight | 98569 | 96.858479 |
| medical_specialty | 49949 | 49.082208 |
| payer_code | 40256 | 39.557416 |
| race | 2273 | 2.233555 |
| diag_3 | 1423 | 1.398306 |
| diag_2 | 358 | 0.351787 |
| diag_1 | 21 | 0.020636 |
# Replace `?` with NaN for now
df_clean = df.copy()
df_clean = df.replace('?', np.nan)
# How many missing values are in each weight class?
# Define the order for weight classes
weight_order = [
"[0-25)", "[25-50)", "[50-75)", "[75-100)", "[100-125)",
"[125-150)", "[150-175)", "[175-200)", ">200", np.nan
]
weight_counts = df_clean['weight'].value_counts(dropna=False)
# Reindex the counts based on the custom order
sorted_weight_counts = weight_counts.reindex(weight_order)
sorted_weight_counts
weight [0-25) 48 [25-50) 97 [50-75) 897 [75-100) 1336 [100-125) 625 [125-150) 145 [150-175) 35 [175-200) 11 >200 3 NaN 98569 Name: count, dtype: int64
# Let's look at age. The low weights will likely be young patients.
df_clean['age'].value_counts(dropna=False)
age [70-80) 26068 [60-70) 22483 [50-60) 17256 [80-90) 17197 [40-50) 9685 [30-40) 3775 [90-100) 2793 [20-30) 1657 [10-20) 691 [0-10) 161 Name: count, dtype: int64
under_50_lbs = ((df_clean['weight'] == '[0-25)') | (df_clean['weight'] == '[25-50)')).sum()
under_10_yrs = (df_clean['age'] == '[0-10)').sum()
print(f'There are {under_50_lbs} patients under 50 lbs and {under_10_yrs} patients under 10 years old.')
There are 145 patients under 50 lbs and 161 patients under 10 years old.
# Let's look closer at max_glu_serum representing glucose serum test results
# One of the categorical levels is `none`. This test is not conducted on most patients.
df_clean['max_glu_serum'].value_counts(dropna=False)
max_glu_serum NaN 96420 Norm 2597 >200 1485 >300 1264 Name: count, dtype: int64
# Let's look at A1Cresult
# One of the categorical levels is `none`. This test is not conducted on most patients.
df_clean['A1Cresult'].value_counts(dropna=False)
A1Cresult NaN 84748 >8 8216 Norm 4990 >7 3812 Name: count, dtype: int64
print(f'There are {df_clean['medical_specialty'].nunique()} unique categories in the `medical_specialty` variable.')
med_specialty_categories = df_clean['medical_specialty'].unique()
med_specialty_categories
There are 72 unique categories in the `medical_specialty` variable.
array(['Pediatrics-Endocrinology', nan, 'InternalMedicine',
'Family/GeneralPractice', 'Cardiology', 'Surgery-General',
'Orthopedics', 'Gastroenterology',
'Surgery-Cardiovascular/Thoracic', 'Nephrology',
'Orthopedics-Reconstructive', 'Psychiatry', 'Emergency/Trauma',
'Pulmonology', 'Surgery-Neuro',
'Obsterics&Gynecology-GynecologicOnco', 'ObstetricsandGynecology',
'Pediatrics', 'Hematology/Oncology', 'Otolaryngology',
'Surgery-Colon&Rectal', 'Pediatrics-CriticalCare', 'Endocrinology',
'Urology', 'Psychiatry-Child/Adolescent', 'Pediatrics-Pulmonology',
'Neurology', 'Anesthesiology-Pediatric', 'Radiology',
'Pediatrics-Hematology-Oncology', 'Psychology', 'Podiatry',
'Gynecology', 'Oncology', 'Pediatrics-Neurology',
'Surgery-Plastic', 'Surgery-Thoracic',
'Surgery-PlasticwithinHeadandNeck', 'Ophthalmology',
'Surgery-Pediatric', 'Pediatrics-EmergencyMedicine',
'PhysicalMedicineandRehabilitation', 'InfectiousDiseases',
'Anesthesiology', 'Rheumatology', 'AllergyandImmunology',
'Surgery-Maxillofacial', 'Pediatrics-InfectiousDiseases',
'Pediatrics-AllergyandImmunology', 'Dentistry', 'Surgeon',
'Surgery-Vascular', 'Osteopath', 'Psychiatry-Addictive',
'Surgery-Cardiovascular', 'PhysicianNotFound', 'Hematology',
'Proctology', 'Obstetrics', 'SurgicalSpecialty', 'Radiologist',
'Pathology', 'Dermatology', 'SportsMedicine', 'Speech',
'Hospitalist', 'OutreachServices', 'Cardiology-Pediatric',
'Perinatology', 'Neurophysiology', 'Endocrinology-Metabolism',
'DCPTEAM', 'Resident'], dtype=object)
value_counts = df_clean['medical_specialty'].value_counts(dropna=False)[df_clean['medical_specialty'].value_counts(dropna=False)>1000]
percentages = (value_counts / len(df_clean['medical_specialty'])) * 100
med_specialty_df = pd.DataFrame({
'Category': value_counts.index,
'Count': value_counts.values,
'Percentage (%)': percentages.values,
})
med_specialty_df
| Category | Count | Percentage (%) | |
|---|---|---|---|
| 0 | NaN | 49949 | 49.082208 |
| 1 | InternalMedicine | 14635 | 14.381031 |
| 2 | Emergency/Trauma | 7565 | 7.433720 |
| 3 | Family/GeneralPractice | 7440 | 7.310890 |
| 4 | Cardiology | 5352 | 5.259124 |
| 5 | Surgery-General | 3099 | 3.045221 |
| 6 | Nephrology | 1613 | 1.585009 |
| 7 | Orthopedics | 1400 | 1.375705 |
| 8 | Orthopedics-Reconstructive | 1233 | 1.211603 |
| 9 | Radiologist | 1140 | 1.120217 |
value_counts = df_clean['payer_code'].value_counts(dropna=False)
percentages = (value_counts / len(df_clean['payer_code'])) * 100
payer_df = pd.DataFrame({
'Category': value_counts.index,
'Count': value_counts.values,
'Percentage (%)': percentages.values,
})
payer_df
| Category | Count | Percentage (%) | |
|---|---|---|---|
| 0 | NaN | 40256 | 39.557416 |
| 1 | MC | 32439 | 31.876069 |
| 2 | HM | 6274 | 6.165124 |
| 3 | SP | 5007 | 4.920111 |
| 4 | BC | 4655 | 4.574219 |
| 5 | MD | 3532 | 3.470707 |
| 6 | CP | 2533 | 2.489043 |
| 7 | UN | 2448 | 2.405519 |
| 8 | CM | 1937 | 1.903386 |
| 9 | OG | 1033 | 1.015074 |
| 10 | PO | 592 | 0.581727 |
| 11 | DM | 549 | 0.539473 |
| 12 | CH | 146 | 0.143466 |
| 13 | WC | 135 | 0.132657 |
| 14 | OT | 95 | 0.093351 |
| 15 | MP | 79 | 0.077629 |
| 16 | SI | 55 | 0.054046 |
| 17 | FR | 1 | 0.000983 |
value_counts = df_clean['race'].value_counts(dropna=False)
percentages = (value_counts / len(df_clean['race'])) * 100
race_df = pd.DataFrame({
'Category': value_counts.index,
'Count': value_counts.values,
'Percentage (%)': percentages.values,
})
race_df
| Category | Count | Percentage (%) | |
|---|---|---|---|
| 0 | Caucasian | 76099 | 74.778413 |
| 1 | AfricanAmerican | 19210 | 18.876639 |
| 2 | NaN | 2273 | 2.233555 |
| 3 | Hispanic | 2037 | 2.001651 |
| 4 | Other | 1506 | 1.479866 |
| 5 | Asian | 641 | 0.629876 |
It seems reasonable that not every patient would have 3 diagnoses. Only 21 are missing diag_1. Are those also missing diag_2 and diag_3?
# Get the indexes where 'diag_1' was '?' (now NaN)
indexes_with_diag_1_questionable = df_clean[df_clean['diag_1'].isna()].index.tolist()
# Look at the rows where diag_1 is '?'
rows_with_diag_1_questionable = df_clean.loc[indexes_with_diag_1_questionable, ['diag_1', 'diag_2', 'diag_3']]
rows_with_diag_1_questionable
| diag_1 | diag_2 | diag_3 | |
|---|---|---|---|
| 518 | NaN | 780 | 997 |
| 1006 | NaN | 595 | 250.6 |
| 1267 | NaN | 250.82 | 401 |
| 1488 | NaN | 276 | 594 |
| 3197 | NaN | 250.01 | 428 |
| 7845 | NaN | 496 | 788 |
| 14503 | NaN | 276 | 250.01 |
| 19714 | NaN | 112 | 585 |
| 32514 | NaN | 998 | 427 |
| 37693 | NaN | 780 | 295 |
| 49516 | NaN | 707 | 427 |
| 56931 | NaN | 585 | 427 |
| 57058 | NaN | V63 | 414 |
| 57737 | NaN | 276 | V08 |
| 60314 | NaN | 427 | 486 |
| 61369 | NaN | 780 | 780 |
| 61763 | NaN | 426 | 427 |
| 66848 | NaN | 780 | 427 |
| 86018 | NaN | 250.02 | 438 |
| 87181 | NaN | NaN | NaN |
| 98396 | NaN | 780 | 250 |
value_counts = df_clean['diag_3'].value_counts(dropna=False)[df_clean['diag_3'].value_counts(dropna=False)>1400]
percentages = (value_counts / len(df_clean['diag_3'])) * 100
diag3_df = pd.DataFrame({
'Category': value_counts.index,
'Count': value_counts.values,
'Percentage (%)': percentages.values,
})
diag3_df
| Category | Count | Percentage (%) | |
|---|---|---|---|
| 0 | 250 | 11555 | 11.354480 |
| 1 | 401 | 8289 | 8.145157 |
| 2 | 276 | 5175 | 5.085195 |
| 3 | 428 | 4577 | 4.497573 |
| 4 | 427 | 3955 | 3.886367 |
| 5 | 414 | 3664 | 3.600417 |
| 6 | 496 | 2605 | 2.559794 |
| 7 | 403 | 2357 | 2.316098 |
| 8 | 585 | 1992 | 1.957432 |
| 9 | 272 | 1969 | 1.934831 |
| 10 | 599 | 1941 | 1.907317 |
| 11 | NaN | 1423 | 1.398306 |
admin_variables = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id']
null_values = [6, 18, 17]
admin_null_df = pd.DataFrame({
'Variable': pd.Series(dtype='str'),
'Null_Value': pd.Series(dtype='int'),
'Count': pd.Series(dtype='int'),
'Percentage (%)': pd.Series(dtype='float')
})
for var, null_val in zip(admin_variables, null_values):
# Get value counts for the current variable
value_counts = df_clean[var].value_counts(dropna=False)
# Extract the count and calculate percentage for the specific "null" value
count = value_counts.get(null_val, 0) # Use .get() to safely retrieve the count
percentage = (count / len(df_clean[var])) * 100
# Create a temporary DataFrame for this variable
temp_df = pd.DataFrame({
'Variable': [var],
'Null_Value': [null_val],
'Count': [count],
'Percentage (%)': [percentage]
})
# Concatenate the temporary DataFrame with the main one
admin_null_df = pd.concat([admin_null_df, temp_df], ignore_index=True)
admin_null_df
| Variable | Null_Value | Count | Percentage (%) | |
|---|---|---|---|---|
| 0 | admission_type_id | 6 | 5291 | 5.199182 |
| 1 | discharge_disposition_id | 18 | 3691 | 3.626948 |
| 2 | admission_source_id | 17 | 6781 | 6.663326 |
Additionally, we need to be wary of any columns with zero variance. Columns that share the same value for every observation offer no explanatory insights for us to work with.
# Identify zero variance columns
zero_variance_cols = [col for col in df_clean.columns if df_clean[col].nunique() <= 1]
zero_variance_df = df_clean.loc[:, zero_variance_cols]
zero_variance_df
| examide | citoglipton | |
|---|---|---|
| 0 | No | No |
| 1 | No | No |
| 2 | No | No |
| 3 | No | No |
| 4 | No | No |
| ... | ... | ... |
| 101761 | No | No |
| 101762 | No | No |
| 101763 | No | No |
| 101764 | No | No |
| 101765 | No | No |
101766 rows × 2 columns
Variables with missing values (either NaN or ?) include weight, max_glu_serum, A1Cresult, medical_specialty, payer_code, race, diag_3, diag_2, and diag_1. There are also integer levels representing NULL in admission_type_id, discharge_disposition_id, and admission_source_id. Finally, there are two columns with zero variance: examide and citoglipton.
Strategies for dealing with missing data:¶
- It seems that for features with a high percentage of missing values (e.g. weight, max_glu_serum, A1Cresult), we should remove these features from our analysis.
- However, is there a way to examine the rows with these values to determine their predictive power? It may be that prediction can be improved significantly with these. Part of the conclusion and recommendation may be for hospitals to include these test for all diabetes patients.
- Check correlations between these variables and the target variable for rows where data is available.
- Use a random forest model to evaluate feature importance to see if these features contribute significantly to prediction when included.
- Investigate differences in the target variable for patients with and without these values.
- Consider that these may test for long term markers rather than parameters critical for acute care. Also, they may be done at routine doctors appointments and they are avoided during acute treatment to avoid duplication.
- It would be interesting to investigate if a certain payer code or medical specialty for which we have records are associated with lower rates of treatment or higher rates of re-admission.
- Use bar plots or heat maps to look into how payer codes and medical specialties are distributed across patients with different outcomes (e.g. re-admission rates).
- Certain payer codes or specialties may correlate with lower rates of tests or treatments, which could indicate disparities in care.
- Consider if it is better to recode a '?', the value in the original dataset, with NaN or a different category like 'Unknown'.
- Features with < 3% missing values (e.g. race, diag_3, diag_2, diag_1) imputation seems like a viable option, either using the mode or something more sophisticated like a multi-class classification technique.
- It is also plausible that some patients only have one or two diagnoses so that is why diag_2 and diag_3 are missing.
- There are also levels coded as integers representing
NULLin admission_type_id (6), discharge_disposition_id (18), and admission_source_id (17). It is unclear what this means. - Regarding the zero variance columns, there's not much we can do except drop these columns and continue the analysis without them.
import missingno as msno
# Visualize missing data as a heatmap
msno.heatmap(df_clean, fontsize=10, figsize=(8, 6))
# plt.savefig(f'plots/missing_heatmap.png')
plt.show()
The heatmap displays correlations in the patterns of missingness between variables. A positive correlation of 0.4 between diag_2 and diag_3 indicates that if a record is missing a value for diag_2, there is an increased likelihood that it is also missing a value for diag_3. This relationship suggests that if diag_2 is missing, we may be able to use diag_3 (if available) to inform our imputation strategy.
The lack of strong correlations between most other variables suggests their missingness is likely independent of one another. However, negative correlations (e.g., -0.1 between medical_specialty and max_glu_serum) indicate a weak inverse relationship, meaning that when one column has a missing value, the other is slightly less likely to be missing.
# Visualize missing data as a matrix
msno.matrix(df_clean.loc[:, df_clean.isnull().any()])
# plt.savefig(f'plots/missing_matrix.png')
plt.show()
This bar graph displays records across the full dataset, with missing values shown in white. There doesn’t appear to be a strong indication that specific rows are disproportionately affected by missingness. Patterns of missing values seem fairly distributed across rows without obvious clustering. This suggests that while variable-level missingness is significant, row-level missingness may not require targeted removal or specific handling.The dendrogram provides insights into relationships between variables based on their missingness patterns. Diagnosis codes (diag_1, diag_2, diag_3) cluster together, indicating shared trends in missing values, which may reflect relationships in the data collection process. Variables like max_glu_serum and weight also cluster, though this may be less meaningful due to the high proportion of missing data in these columns.
# Visualize dendrogram for missingness correlations
msno.dendrogram(df_clean.loc[:, df_clean.isnull().any()])
# plt.savefig(f'plots/missing_dendrogram.png')
plt.show()
The dendrogram provides insights into relationships between variables based on their missingness patterns. Diagnosis codes (diag_1, diag_2, diag_3) cluster together, indicating shared trends in missing values, which may reflect relationships in the data collection process. Variables like max_glu_serum and weight also cluster, though this may be less meaningful due to the high proportion of missing data in these columns.
B. Decisions on handling the missing data for further processing¶
There isn't an Unknown category in medical_specialty, payer_code, or race. It seems reasonable to make the NaN ('?') values an Unknown category rather than a NaN. Also, it seems plausible that not every patient would have more than one diagnosis code. Converting NaN ('?') to Unknown/None seems reasonable for the diagnosis codes. The data set documentation describes a 'none' category for both max_glu_serum and A1Cresult. Making the assumption that the NaN values are the values in the 'none' category and 'none' likely represents patients that did not receive this test, NaN values were replaced with Untested.
# Replace NaN ('?') with 'Unknown' in the specified columns
columns_to_update_1 = ['medical_specialty', 'payer_code', 'race']
df_clean[columns_to_update_1] = df_clean[columns_to_update_1].replace(np.nan, 'Unknown')
# Replace NaN ('?') with 'Unknown' in the specified columns
columns_to_update_2 = ['diag_1', 'diag_2', 'diag_3']
df_clean[columns_to_update_2] = df_clean[columns_to_update_2].replace(np.nan, 'Unknown/None')
# Replace NaN with 'Untested' in the specified columns
columns_to_update_3 = ['max_glu_serum', 'A1Cresult']
df_clean[columns_to_update_3] = df_clean[columns_to_update_3].replace(np.nan, 'Untested')
3.2.2 Duplicates¶
duplicate_count = df_clean.duplicated().sum()
print(f'The dataset has {duplicate_count} duplicate records.')
The dataset has 0 duplicate records.
3.2.3 Remove ID variable: 'encounter_id'¶
# Verify that encounter_id is unique for each record.
df_clean['encounter_id'].nunique() == len(df_clean)
True
if 'encounter_id' in df_clean:
del df_clean['encounter_id']
# if 'patient_nbr' in df_clean:
# del df_clean['patient_nbr']
print( df_clean.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 patient_nbr 101766 non-null int64 1 race 101766 non-null object 2 gender 101766 non-null object 3 age 101766 non-null object 4 weight 3197 non-null object 5 admission_type_id 101766 non-null int64 6 discharge_disposition_id 101766 non-null int64 7 admission_source_id 101766 non-null int64 8 time_in_hospital 101766 non-null int64 9 payer_code 101766 non-null object 10 medical_specialty 101766 non-null object 11 num_lab_procedures 101766 non-null int64 12 num_procedures 101766 non-null int64 13 num_medications 101766 non-null int64 14 number_outpatient 101766 non-null int64 15 number_emergency 101766 non-null int64 16 number_inpatient 101766 non-null int64 17 diag_1 101766 non-null object 18 diag_2 101766 non-null object 19 diag_3 101766 non-null object 20 number_diagnoses 101766 non-null int64 21 max_glu_serum 101766 non-null object 22 A1Cresult 101766 non-null object 23 metformin 101766 non-null object 24 repaglinide 101766 non-null object 25 nateglinide 101766 non-null object 26 chlorpropamide 101766 non-null object 27 glimepiride 101766 non-null object 28 acetohexamide 101766 non-null object 29 glipizide 101766 non-null object 30 glyburide 101766 non-null object 31 tolbutamide 101766 non-null object 32 pioglitazone 101766 non-null object 33 rosiglitazone 101766 non-null object 34 acarbose 101766 non-null object 35 miglitol 101766 non-null object 36 troglitazone 101766 non-null object 37 tolazamide 101766 non-null object 38 examide 101766 non-null object 39 citoglipton 101766 non-null object 40 insulin 101766 non-null object 41 glyburide-metformin 101766 non-null object 42 glipizide-metformin 101766 non-null object 43 glimepiride-pioglitazone 101766 non-null object 44 metformin-rosiglitazone 101766 non-null object 45 metformin-pioglitazone 101766 non-null object 46 change 101766 non-null object 47 diabetesMed 101766 non-null object 48 readmitted 101766 non-null object dtypes: int64(12), object(37) memory usage: 38.0+ MB None
# Convert categorical variables `patient_nbr`, `admission_type_id`, `discharge_disposition_id`, `admission_source_id`
# from integer to object datatype.
categoricalInt_cols = ['patient_nbr', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id']
df_clean[categoricalInt_cols] = df_clean[categoricalInt_cols].astype('category')
df_clean[['patient_nbr', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id']].dtypes
patient_nbr category admission_type_id category discharge_disposition_id category admission_source_id category dtype: object
3.3 Statistical summaries of the features¶
Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean or if you found something interesting. Note: You can also use data from other sources for comparison. Explain the significance of the statistics run and why they are meaningful.¶
3.3.1 Numerical features¶
Statistical Summaries of Numerical Attributes:
- Count: This is the raw count of observations for a particular attribute in the dataset
- Mean: The average value of the observations for a particular attribute.
- Std: The standard deviation, measuring the spread of values among observations.
- Variance: The average of the squared differences from the mean, showing variability.
- Min: The smallest value among the observations for a particular attribute.
- 25%: The first quartile, representing the value below which 25% of the data falls.
- Median: The middle value when observations are sorted, dividing the dataset into two equal halves.
- 75: The third quartile, representing the value below which 75% of the data falls.
- Max: The largest value among the observations for a particular attribute.
- Mode: This is the most commonly recurring value among observations for a particular attribute
- Range: The difference between the maximum and minimum values, showing the span of data
- Kurtosis: A measure of how heavily the tails of the distribution differ compared to a normal distribution, showing the extent of the outliers.
- Kurstosis ~= 3, Tails are neither heavy nor light
- Kurtosis > 3, Heavy tails and a sharper peak
- Kurtosis < 3, Lighter tails and a flatter peak
- Skew: A measure of asymmetry in the distribution of the data.
- Skew < 0 indicates Negative/Left-Skew
- Skew > 0 indicates Positive/Right-Skew
# Summary statistics of the numerical features
# Calculate basic statistics
desc = df_clean.describe() # Includes count, mean, std, 5-number summary
modes_num = df_clean.mode(numeric_only=True).iloc[0] # Get the first mode for each column
variances = df_clean.var(numeric_only=True)
ranges = df_clean.max(numeric_only=True) - df_clean.min(numeric_only=True)
kurt = df_clean.kurt(numeric_only = True)
skew = df_clean.skew(numeric_only = True)
# Create a summary DataFrame
num_stats_df = pd.DataFrame({
'Count': desc.loc['count'],
'Mean': desc.loc['mean'],
'Std': desc.loc['std'],
'Min': desc.loc['min'],
'25%': desc.loc['25%'],
'Median': desc.loc['50%'],
'75%': desc.loc['75%'],
'Max': desc.loc['max'],
'Range': ranges,
'Variance': variances,
'Mode': modes_num,
'Kurtosis': kurt,
'Skew': skew
})
# Rearrange and clean up for better presentation
num_stats_df = num_stats_df[['Count', 'Mean', 'Std', 'Variance', 'Min', '25%', 'Median', '75%', 'Max', 'Range', 'Mode', 'Kurtosis', 'Skew']]
num_stats_df.reset_index(inplace=True) # Optional: Keep variable names as a column
num_stats_df.rename(columns={'index': 'Variable'}, inplace=True)
# Display the final DataFrame
num_stats_df
| Variable | Count | Mean | Std | Variance | Min | 25% | Median | 75% | Max | Range | Mode | Kurtosis | Skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | time_in_hospital | 101766.0 | 4.395987 | 2.985108 | 8.910868 | 1.0 | 2.0 | 4.0 | 6.0 | 14.0 | 13 | 3 | 0.850251 | 1.133999 |
| 1 | num_lab_procedures | 101766.0 | 43.095641 | 19.674362 | 387.080530 | 1.0 | 31.0 | 44.0 | 57.0 | 132.0 | 131 | 1 | -0.245074 | -0.236544 |
| 2 | num_procedures | 101766.0 | 1.339730 | 1.705807 | 2.909777 | 0.0 | 0.0 | 1.0 | 2.0 | 6.0 | 6 | 0 | 0.857110 | 1.316415 |
| 3 | num_medications | 101766.0 | 16.021844 | 8.127566 | 66.057332 | 1.0 | 10.0 | 15.0 | 20.0 | 81.0 | 80 | 13 | 3.468155 | 1.326672 |
| 4 | number_outpatient | 101766.0 | 0.369357 | 1.267265 | 1.605961 | 0.0 | 0.0 | 0.0 | 0.0 | 42.0 | 42 | 0 | 147.907736 | 8.832959 |
| 5 | number_emergency | 101766.0 | 0.197836 | 0.930472 | 0.865779 | 0.0 | 0.0 | 0.0 | 0.0 | 76.0 | 76 | 0 | 1191.686726 | 22.855582 |
| 6 | number_inpatient | 101766.0 | 0.635566 | 1.262863 | 1.594824 | 0.0 | 0.0 | 0.0 | 1.0 | 21.0 | 21 | 0 | 20.719397 | 3.614139 |
| 7 | number_diagnoses | 101766.0 | 7.422607 | 1.933600 | 3.738810 | 1.0 | 6.0 | 8.0 | 9.0 | 16.0 | 15 | 9 | -0.079056 | -0.876746 |
We computed the following summary statistics for the numerical variables: count, mean, standard deviation, variance, five-number summary (minimum and maximum, 25th and 75th percentile, and median), range, and mode. These tell us how many records there are and give us a sense of the distribution of the data. We have the measures of center (mean, median and mode), as well as measures of spread (standard deviation and variance). We also can see the full range of the data and and where the middle 50% of the data lie.
For many of these features the mean and median (and also mostly the mode) are similar values. The distribution appear to be right skewed in many cases by high outliers. The exception to that seems to be number of lab procedures, which seems to be more normally distributed.
Additionally, in combination with skewness, several variables have a max-to-min ratio much greater than 10 (e.g. num_lab_procedures, num_medications, number_outpatient, number_emergency, number_inpatient), suggesting that a log transformation may be helpful.
For the categorical variables (including the target), we computed the following summary statistics: count, number of unique values (levels), Top - the level that occurs most frequently, Freq - the number of occurences of the top level, and the percentage of the values within that level.
This indicates that for many features, nearly all of the values fall within one category (e.g. over 99% of patients do not take 15 out of the 23 drugs in the dataset). We also can see that there are repeated patients in the dataset.
3.3.2 Categorical features and target variable¶
# Summary statistics of the categorical features
# Calculate basic statistics
desc_cat = df_clean.describe(include=['object', 'category']) # Includes count, unique, top, freq
top_percentage = (desc_cat.loc['freq'] / desc_cat.loc['count'] * 100).round(2)
# Create a summary DataFrame
cat_stats_df = pd.DataFrame({
'Count': desc_cat.loc['count'],
'Unique': desc_cat.loc['unique'],
'Top': desc_cat.loc['top'],
'Freq': desc_cat.loc['freq'],
'Top %': top_percentage
})
# Rearrange and clean up for better presentation
cat_stats_df.reset_index(inplace=True) # Optional: Keep variable names as a column
cat_stats_df.rename(columns={'index': 'Variable'}, inplace=True)
# Display the final DataFrame
cat_stats_df
| Variable | Count | Unique | Top | Freq | Top % | |
|---|---|---|---|---|---|---|
| 0 | patient_nbr | 101766 | 71518 | 88785891 | 40 | 0.039306 |
| 1 | race | 101766 | 6 | Caucasian | 76099 | 74.778413 |
| 2 | gender | 101766 | 3 | Female | 54708 | 53.758623 |
| 3 | age | 101766 | 10 | [70-80) | 26068 | 25.615628 |
| 4 | weight | 3197 | 9 | [75-100) | 1336 | 41.789177 |
| 5 | admission_type_id | 101766 | 8 | 1 | 53990 | 53.053083 |
| 6 | discharge_disposition_id | 101766 | 26 | 1 | 60234 | 59.188727 |
| 7 | admission_source_id | 101766 | 17 | 7 | 57494 | 56.496276 |
| 8 | payer_code | 101766 | 18 | Unknown | 40256 | 39.557416 |
| 9 | medical_specialty | 101766 | 73 | Unknown | 49949 | 49.082208 |
| 10 | diag_1 | 101766 | 717 | 428 | 6862 | 6.74292 |
| 11 | diag_2 | 101766 | 749 | 276 | 6752 | 6.634829 |
| 12 | diag_3 | 101766 | 790 | 250 | 11555 | 11.35448 |
| 13 | max_glu_serum | 101766 | 4 | Untested | 96420 | 94.746772 |
| 14 | A1Cresult | 101766 | 4 | Untested | 84748 | 83.277322 |
| 15 | metformin | 101766 | 4 | No | 81778 | 80.358862 |
| 16 | repaglinide | 101766 | 4 | No | 100227 | 98.487707 |
| 17 | nateglinide | 101766 | 4 | No | 101063 | 99.3092 |
| 18 | chlorpropamide | 101766 | 4 | No | 101680 | 99.915492 |
| 19 | glimepiride | 101766 | 4 | No | 96575 | 94.899082 |
| 20 | acetohexamide | 101766 | 2 | No | 101765 | 99.999017 |
| 21 | glipizide | 101766 | 4 | No | 89080 | 87.534147 |
| 22 | glyburide | 101766 | 4 | No | 91116 | 89.534815 |
| 23 | tolbutamide | 101766 | 2 | No | 101743 | 99.977399 |
| 24 | pioglitazone | 101766 | 4 | No | 94438 | 92.799167 |
| 25 | rosiglitazone | 101766 | 4 | No | 95401 | 93.745455 |
| 26 | acarbose | 101766 | 4 | No | 101458 | 99.697345 |
| 27 | miglitol | 101766 | 4 | No | 101728 | 99.962659 |
| 28 | troglitazone | 101766 | 2 | No | 101763 | 99.997052 |
| 29 | tolazamide | 101766 | 3 | No | 101727 | 99.961677 |
| 30 | examide | 101766 | 1 | No | 101766 | 100.0 |
| 31 | citoglipton | 101766 | 1 | No | 101766 | 100.0 |
| 32 | insulin | 101766 | 4 | No | 47383 | 46.560737 |
| 33 | glyburide-metformin | 101766 | 4 | No | 101060 | 99.306252 |
| 34 | glipizide-metformin | 101766 | 2 | No | 101753 | 99.987226 |
| 35 | glimepiride-pioglitazone | 101766 | 2 | No | 101765 | 99.999017 |
| 36 | metformin-rosiglitazone | 101766 | 2 | No | 101764 | 99.998035 |
| 37 | metformin-pioglitazone | 101766 | 2 | No | 101765 | 99.999017 |
| 38 | change | 101766 | 2 | No | 54755 | 53.804807 |
| 39 | diabetesMed | 101766 | 2 | Yes | 78363 | 77.003125 |
| 40 | readmitted | 101766 | 3 | NO | 54864 | 53.911916 |
# Find the proportion of each response class.
print(df_clean['readmitted'].value_counts()/df_clean['readmitted'].count())
readmitted NO 0.539119 >30 0.349282 <30 0.111599 Name: count, dtype: float64
The response class is unbalanced with 54% patient records not being readmitted, 35% being readmitted in greater than 30 days, and 11% being readmitted in less than 30 days.
# Find how many patients are have more than one encounter.
patient_counts = df_clean['patient_nbr'].value_counts(dropna=False)
repeated_patients = patient_counts[patient_counts > 1]
num_repeated_patients = len(repeated_patients)
print(f"Number of patients with more than one record: {num_repeated_patients}")
repeated_patients.min(), repeated_patients.max()
Number of patients with more than one record: 16773
(2, 40)
patient_nbr is not unique for each record in the dataset indicating that there are repeated measures. 16,773 patients have between 2 and 40 encounters.
# Verify that values in 'examide' and 'citoglipton' are 100% 'No' across the full dataset
print(df_clean['examide'].value_counts())
print(df_clean['citoglipton'].value_counts())
examide No 101766 Name: count, dtype: int64 citoglipton No 101766 Name: count, dtype: int64
We should be able to eliminate these variables, since they aren't offering any insight.
3.3.3 Remove zero variance and high missing variables¶
# Remove examide and citoglipton from the dataset.
df_clean = df_clean.drop(columns=['examide', 'citoglipton'])
df_clean = df_clean.drop(columns=['weight'])
print( df_clean.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 patient_nbr 101766 non-null category 1 race 101766 non-null object 2 gender 101766 non-null object 3 age 101766 non-null object 4 admission_type_id 101766 non-null category 5 discharge_disposition_id 101766 non-null category 6 admission_source_id 101766 non-null category 7 time_in_hospital 101766 non-null int64 8 payer_code 101766 non-null object 9 medical_specialty 101766 non-null object 10 num_lab_procedures 101766 non-null int64 11 num_procedures 101766 non-null int64 12 num_medications 101766 non-null int64 13 number_outpatient 101766 non-null int64 14 number_emergency 101766 non-null int64 15 number_inpatient 101766 non-null int64 16 diag_1 101766 non-null object 17 diag_2 101766 non-null object 18 diag_3 101766 non-null object 19 number_diagnoses 101766 non-null int64 20 max_glu_serum 101766 non-null object 21 A1Cresult 101766 non-null object 22 metformin 101766 non-null object 23 repaglinide 101766 non-null object 24 nateglinide 101766 non-null object 25 chlorpropamide 101766 non-null object 26 glimepiride 101766 non-null object 27 acetohexamide 101766 non-null object 28 glipizide 101766 non-null object 29 glyburide 101766 non-null object 30 tolbutamide 101766 non-null object 31 pioglitazone 101766 non-null object 32 rosiglitazone 101766 non-null object 33 acarbose 101766 non-null object 34 miglitol 101766 non-null object 35 troglitazone 101766 non-null object 36 tolazamide 101766 non-null object 37 insulin 101766 non-null object 38 glyburide-metformin 101766 non-null object 39 glipizide-metformin 101766 non-null object 40 glimepiride-pioglitazone 101766 non-null object 41 metformin-rosiglitazone 101766 non-null object 42 metformin-pioglitazone 101766 non-null object 43 change 101766 non-null object 44 diabetesMed 101766 non-null object 45 readmitted 101766 non-null object dtypes: category(4), int64(8), object(34) memory usage: 35.9+ MB None
3.3.4 Outliers¶
# Calculate IQR and outlier bounds
num_stats_df['IQR'] = num_stats_df['75%'] - num_stats_df['25%'] # Interquartile Range
num_stats_df['Lower_IQR_Bound'] = num_stats_df['25%'] - 1.5 * num_stats_df['IQR']
num_stats_df['Upper_IQR_Bound'] = num_stats_df['75%'] + 1.5 * num_stats_df['IQR']
outliers = {} # Empty dictionary to store outliers for each variable
total_rows = len(df_clean)
for var in df_clean.select_dtypes(include='number').columns:
# Get lower and upper bounds for the variable
lower_bound = num_stats_df.loc[num_stats_df['Variable'] == var, 'Lower_IQR_Bound'].values[0]
upper_bound = num_stats_df.loc[num_stats_df['Variable'] == var, 'Upper_IQR_Bound'].values[0]
# Find outliers
outliers[var] = df_clean[(df_clean[var] < lower_bound) | (df_clean[var] > upper_bound)][var]
# Summarize outliers
outlier_counts = {var: len(outlier_values) for var, outlier_values in outliers.items()}
outlier_proportions = {var: round((count / total_rows) * 100, 2) for var, count in outlier_counts.items()} # Convert to percentage and round
# Create a summary DataFrame
outliers_summary_df = pd.DataFrame({
'Variable': outlier_counts.keys(),
'Outlier Count': outlier_counts.values(),
'Outlier Proportion (%)': outlier_proportions.values() # Updated to percentage
})
outliers_summary_df
| Variable | Outlier Count | Outlier Proportion (%) | |
|---|---|---|---|
| 0 | time_in_hospital | 2252 | 2.21 |
| 1 | num_lab_procedures | 143 | 0.14 |
| 2 | num_procedures | 4954 | 4.87 |
| 3 | num_medications | 2557 | 2.51 |
| 4 | number_outpatient | 16739 | 16.45 |
| 5 | number_emergency | 11383 | 11.19 |
| 6 | number_inpatient | 7049 | 6.93 |
| 7 | number_diagnoses | 281 | 0.28 |
Suspected outliers were identified using the interquartile range (IQR) method: values below the 25th percentile minus 1.5 times the IQR or above the 75th percentile plus 1.5 times the IQR were classified as outliers. Variables related to hospital visits (e.g. number_outpatient, number_emergency, number_inpatient) exhibited a higher proportion of outliers, with more than 5% of observations flagged. These variables had low mean, median, and mode values, but a small subset of patients had disproportionately high numbers of visits.
Other numeric variables had fewer than 5% of observations defined as outliers. It may be worth examining potential correlations among the hospital visit variables, as certain patients could be more prone to frequent hospital visits and associated interventions.
def create_variable_grid(n_variables, n_cols=3, figsize=(16, 4)):
"""
Creates a dynamic grid layout for subplots.
Parameters:
- n_variables: Total number of variables to visualize.
- n_cols: Number of columns in the subplot grid.
- figsize: Tuple representing the figure size per row.
Returns:
- fig, axes: Matplotlib figure and flattened axes array.
"""
# Calculate the number of rows needed
n_rows = (n_variables + n_cols - 1) // n_cols # Ceiling division
# Create subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(figsize[0], figsize[1] * n_rows))
axes = axes.flatten() # Flatten axes for easier iteration
# Turn off unused subplots
for ax in axes[n_variables:]:
ax.axis("off") # Hide unused axes
return fig, axes
import scipy.stats as stats
numerical_df = df_clean.select_dtypes(include='number')
def improved_qqplot(numerical_df, axes):
"""
Enhanced Q-Q plot with reference line, outlier highlighting, and summary statistics.
"""
# Perform Q-Q plot
stats.probplot(numerical_df, dist = "norm", plot = axes)
# Add title and labels
axes.set_title(f"Q-Q: {column}", fontsize=12)
axes.set_xlabel("Theoretical Quantiles", fontsize=10)
axes.set_ylabel("Sample Quantiles", fontsize=10)
# Annotate summary statistics
skewness = stats.skew(numerical_df, nan_policy="omit")
kurtosis = stats.kurtosis(numerical_df, nan_policy="omit")
axes.text(0.05, 0.9, f"Skewness: {skewness:.2f}\nKurtosis: {kurtosis:.2f}",
transform=axes.transAxes, fontsize=10, bbox=dict(facecolor='white', alpha=0.5))
fig, axes = create_variable_grid(len(numerical_df.columns))
for i, column in enumerate(numerical_df.columns):
improved_qqplot(numerical_df[column].dropna(), axes[i])
plt.tight_layout()
# plt.savefig(f'plots/qqPlots.png')
plt.show()
Q-Q Plots:
- time_in_hospital:
- Skewness: 1.13: Indicating a moderate right-skew, suggesting values tend to concentrate on the left side of the distribution.
- Kurtosis: 0.85: The kurtosis indicates that the distribution is slightly more peaked than a normal distribution
- Overall: The distribution is slightly asymmetric with a right skew and relatively light tails.
- num_lab_procedures:
- Skewness: -0.24: A skewness of -0.24 indicates a slight left skew; but it's very close to zero, suggesting near symmetry.
- Kurtosis: -0.25: The kurtosis indicates that the distribution is slightly flatter than a normal distribution.
- Overall: The distribution is nearly symmetric with a slight left skew and flat tails, indicating a lack of extreme outliers.
- num_procedures:
- Skewness: 1.32: The positive skewness indicates a moderate right skew, with a tail extending towards higher values.
- Kurtosis: 0.86:Light tails, suggesting fewer extreme outliers than a normal distribution
- Overall: The distribution is asymmetric with a noticeable right skew and relatively light tails, indicating a concentration of values toward the lower end.
- num_medications:
- Skewness: 1.33: Moderately positively skewed, indicating a right tail in the distribution
- Kurtosis: 3.47: High kurtosis, suggesting the presence of more extreme outliers than a normal distribution
- Overall: The distribution is asymmetric with a significant right skew and heavy tails, indicating a concentration of lower values with some extreme high values.
- number_outpatient:
- Skewness: 8.83: Very high skewness (8.83), indicating a significant right skew. The data has many low values and a few extreme high values.
- Kurtosis: 147.90: The extremely high kurtosis (147.90) suggests that the distribution is heavily tailed, with many extreme outliers.
- Overall: The distribution is highly asymmetric, dominated by a heavy right tail and extreme outliers, with most values concentrated near zero.
- number_emergency:
- Skewness: 22.86: The skewness value of 22.86 is extraordinarily high, indicating a severe right skew. This suggests the presence of extreme values far above the rest of the data.
- Kurtosis: 1191.63: The kurtosis is extremely high (1191.63), indicating that the distribution has very heavy tails and many extreme outliers.
- Overall: The distribution is highly asymmetric, with most values near zero and a small number of extreme high values creating a very heavy right tail.
- number_inpatient:
- Skewness: 3.61: This value would indicate that there's a strong right skew.
- Kurtosis: 20.72: The kurtosis is very high, suggesting that the distribution has heavy tails with extreme outliers.
- Overall: The distribution is asymmetric, with a noticeable right skew and heavy tails caused by extreme high values.
- number_diagnoses:
- Skewness: -0.88: Moderately negatively skewed, indicating a left tail in the distribution
- Kurtosis: -0.08: Nearly normal kurtosis, suggesting a shape close to a normal distribution
- Overall: The distribution is slightly asymmetric with a left skew and relatively normal tails, indicating a mild concentration of lower values.
Let's identify and plot the outliers using the IQR.
def plot_outliers_iqr(numerical_df, multiplier=1.5, axes=None):
"""
Identifies and visualizes outliers in numerical columns using the IQR method.
Parameters:
numerical_df (pd.DataFrame): DataFrame with numerical columns.
multiplier (float): The multiplier for the IQR to define outliers.
axes (list): List of Matplotlib axes to plot on.
Returns:
None
"""
for i, column in enumerate(numerical_df.columns):
Q1 = numerical_df[column].quantile(0.25)
Q3 = numerical_df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
outliers = numerical_df[(numerical_df[column] < lower_bound) | (numerical_df[column] > upper_bound)]
sns.scatterplot(data=numerical_df, x=numerical_df.index, y=column, ax=axes[i], color='blue', label='Data')
if not outliers.empty:
sns.scatterplot(data=outliers, x=outliers.index, y=column, ax=axes[i], color='red', label='Outliers')
axes[i].set_title(f'Outliers in {column} (IQR Multiplier > {multiplier})')
# Remove any unused axes
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
fig, axes = create_variable_grid(len(numerical_df.columns))
plot_outliers_iqr(numerical_df, axes=axes)
plt.tight_layout()
# plt.savefig(f'plots/outlierPlots.png')
plt.show()
Outlier Plots:
- time_in_hospital:
- Most values fall within a small range, with outliers concentrated at higher durations.
- The distribution has visible horizontal bands due to discrete values.
- num_lab_procedures:
- A high concentration of values in the lower range, with outliers extending well beyond 100.
- Outliers are sparsely distributed, representing extreme values in lab procedures.
- num_procedures:
- The data is heavily banded due to its discrete nature, with very few outliers visible.
- Most values fall within the range of 0–5.
- num_medications:
- A significant portion of the dataset exceeds the IQR threshold, indicating a substantial number of outliers.
- Outliers dominate the upper range (values >30).
- number_outpatient:
- Many outliers are visible, especially in the upper range, with values exceeding 40.
- The majority of data points remain concentrated near zero.
- number_emergency:
- The data is skewed toward zero, with a large number of extreme outliers above the IQR threshold.
- Outliers are well-separated, forming a sparse tail.
- number_inpatient:
- Similar to outpatient visits, most values are close to zero, with a growing number of outliers at higher inpatient counts.
- Outliers appear spread out in the upper range.
- number_diagnoses:
- Most values are clustered between 0 and 9, with a few significant outliers around 16.
- Horizontal banding is due to discrete counts.
Summary of Outliers:
Based on the QQ plots, most variables deviate from normality, with skewed distributions and heavy tails. Several variables have a considerable number of extreme values, especially number_emergency, number_outpatient, and num_medications. These outliers dominate the upper ranges and might bias our analysis if no action is taken.
Outlier Handling: For highly skewed variables with significant outliers, we may need to consider capping/flooring or removing extreme outliers. Though we'll need to assess if this negatively impacts the honesty of our models. Additionally, we may be able to normalize/transform some of these values and extract some insgihts from them. For the discrete values, we may not be able to transform them, as they likely align with real-world observations. However, we may consider grouping rare values to simplify the data slightly. Finally, a potential solution could be engineering new features from the outliers (e.g., flagging hyper-critical emergency visits or highly medicated individuals).
3.3.5 Data preprocessing: Explicitly set the order of ordinal variables¶
# # Define the correct order for each variable
readmit_order = ['<30', '>30', 'NO']
drug_order = ['No', 'Down', 'Steady', 'Up']
max_glu_serum_order = ['Untested', 'Norm', '>200', '>300']
a1cresult_order = ['Untested', 'Norm', '>7', '>8']
age_order = ['[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)',
'[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)']
# # weight_order = ['[0-25)', '[25-50)', '[50-75)', '[75-100)', '[100-125)',
# # '[125-150)', '[150-175)', '[175-200)', '>200'] # dropped this column
# # List of drug-related variables
drug_columns = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'tolazamide',
'pioglitazone', 'rosiglitazone', 'troglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'glipizide-metformin',
'metformin-rosiglitazone', 'metformin-pioglitazone', 'glimepiride-pioglitazone']
# # Reorder categories in the DataFrame
df_clean['readmitted'] = pd.Categorical(df_clean['readmitted'], categories=readmit_order, ordered=True)
df_clean['max_glu_serum'] = pd.Categorical(df_clean['max_glu_serum'], categories=max_glu_serum_order, ordered=True)
df_clean['A1Cresult'] = pd.Categorical(df_clean['A1Cresult'], categories=a1cresult_order, ordered=True)
df_clean['age'] = pd.Categorical(df_clean['age'], categories=age_order, ordered=True)
# df_clean['weight'] = pd.Categorical(df_clean['weight'], categories=weight_order, ordered=True)
for col in drug_columns:
if col in df_clean.columns:
df_clean[col] = pd.Categorical(df_clean[col], categories=drug_order, ordered=True)
# Set the color palette
palette = sns.color_palette("muted", n_colors=len(readmit_order))
# Calculate counts and proportions of each category
readmit_counts = df_clean['readmitted'].value_counts(sort=False) # Counts
readmit_proportions = readmit_counts / readmit_counts.sum() # Proportions
# Plot the distribution
fig, ax = plt.subplots()
sns.barplot(
x=readmit_proportions.index,
y=readmit_proportions.values,
hue=readmit_proportions.index,
palette=palette,
ax=ax
)
# Add titles and labels
ax.set_title('Distribution of Readmitted Categories')
ax.set_xlabel('Readmitted')
ax.set_ylabel('Proportion')
# Add counts as labels on top of each bar
for bar, count in zip(ax.patches, readmit_counts.values):
ax.text(
bar.get_x() + bar.get_width() / 2, # x-coordinate
bar.get_height(), # y-coordinate (bar height)
f'{count}', # Text label
ha='center', va='bottom', # Center alignment
fontsize=10
)
plt.tight_layout()
# plt.savefig(f'plots/barplot_readmitted.png', bbox_inches="tight")
plt.show()
3.4.2 Numerical features¶
This plots the distribution of each numerical variable split by response class. A histogram (or kernel density plot) and boxplots are plotted to get a visual sense of each variable.
# List of numeric columns
numeric_cols = df_clean.select_dtypes(include=['int64', 'float64']).columns
# Set the color palette
palette = sns.color_palette("muted", n_colors=len(readmit_order))
# Generate plots for each numeric column
for col in numeric_cols:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
# Histogram with proportions
for idx, category in enumerate(readmit_order):
subset = df_clean[df_clean['readmitted'] == category]
sns.histplot(subset[col], bins=20, color=palette[idx], label=category, alpha=0.5, stat="probability", ax=axes[0])
axes[0].set_title(f"Histogram of {col} (Proportions) by Response Class")
axes[0].set_xlabel(col)
axes[0].set_ylabel('Proportion')
axes[0].legend(title='Readmitted')
# OR Kernel Density Plot
# for idx, category in enumerate(readmit_order):
# subset = df_clean[df_clean['readmitted'] == category]
# sns.kdeplot(subset[col], fill=True, color=palette[idx], label=category, alpha=0.5, ax=axes[0])
# axes[0].set_title(f"Kernel Density Plot of {col} by Response Class")
# axes[0].set_xlabel(col)
# axes[0].set_ylabel('Density')
# axes[0].legend(title='Readmitted')
# Boxplot
sns.boxplot(data=df_clean, x='readmitted', y=col, hue='readmitted', palette=palette[:len(readmit_order)], ax=axes[1])
axes[1].set_title(f"Boxplot of {col} by Response Class")
axes[1].set_xlabel('Readmitted')
axes[1].set_ylabel(col)
plt.tight_layout()
# plt.savefig(f'plots/plot_{col}_by_readmitted.png', bbox_inches="tight")
plt.show()
Based on the skewness and large max-to-min ratio in several of the variables, it would be nice to see how the distributions look after log transformation.
# Based on the summary statistics, let's look at the log transformed distributions of a few variables.
df_transform = df_clean.copy()
# List of numeric columns
transform_cols = ['num_lab_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient']
# Generate plots for each numeric column
for col in transform_cols:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
# Histogram with log transformed proportions
for idx, category in enumerate(readmit_order):
subset = df_transform [df_transform ['readmitted'] == category]
sns.histplot(np.log1p(subset[col]), bins=20, color=palette[idx], label=category, alpha=0.5, stat="probability", ax=axes[0])
axes[0].set_title(f'Histogram of Log {col} (Proportions) by Response Class')
axes[0].set_xlabel(f'log({col} + 1)')
axes[0].set_ylabel('Proportion')
axes[0].legend(title='Readmitted')
# Histogram with untransformed proportions
for idx, category in enumerate(readmit_order):
subset = df_transform[df_transform['readmitted'] == category]
sns.histplot(subset[col], bins=20, color=palette[idx], label=category, alpha=0.5, stat="probability", ax=axes[1])
axes[1].set_title(f"Histogram of {col} (Proportions) by Response Class")
axes[1].set_xlabel(col)
axes[1].set_ylabel('Proportion')
axes[1].legend(title='Readmitted')
# OR Kernel Density Plot
# for idx, category in enumerate(readmit_order):
# subset = df_transform [df_transform ['readmitted'] == category]
# sns.kdeplot(np.log1p(subset[col]), fill=True, color=palette[idx], label=category, alpha=0.5, ax=axes[0])
# axes[0].set_title(f'Kernel Density Plot of log({col}) by Response Class')
# axes[0].set_xlabel(f'log({col} + 1)')
# axes[0].set_ylabel('Density')
# axes[0].legend(title='Readmitted')
# # Boxplot
# df_transform[f'log_{col}'] = np.log1p(df_transform[col])
# sns.boxplot(data=df_transform, x='readmitted', y=f'log_{col}', hue='readmitted', palette=palette[:len(readmit_order)], ax=axes[1])
# axes[1].set_title(f'Boxplot of log({col}) by Response Class')
# axes[1].set_xlabel('Readmitted')
# axes[1].set_ylabel(f'log({col} + 1)')
plt.tight_layout()
# plt.savefig(f'plots/plot_log({col})_by_readmitted.png', bbox_inches="tight")
plt.show()
Log transformation doesn't seem to dramatically change or normalize the distributions, with the exception of perhap num_medications.
# This is to get a sense of the importance of a variable in target class separation.
# Group numerical variables by `readmitted` and view some summary stats
# Split the numerical columns into two groups
split_point = len(numeric_cols) // 2
numeric_cols_part1 = numeric_cols[:split_point]
numeric_cols_part2 = numeric_cols[split_point:]
# Group numerical variables (first half) by `readmitted` and calculate summary stats
aggregated_numstats1 = (
df_clean.groupby('readmitted', observed=False)[numeric_cols_part1]
.agg(['mean', 'std', 'median'])
.loc[['<30', '>30', 'NO']] # Reorder rows for readability
)
# Group numerical variables (second half) by `readmitted` and calculate summary stats
aggregated_numstats2 = (
df_clean.groupby('readmitted', observed=False)[numeric_cols_part2]
.agg(['mean', 'std', 'median'])
.loc[['<30', '>30', 'NO']]
)
print("Part 1: Summary statistics for the first group of numerical variables")
aggregated_numstats1
Part 1: Summary statistics for the first group of numerical variables
| time_in_hospital | num_lab_procedures | num_procedures | num_medications | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | median | mean | std | median | mean | std | median | mean | std | median | |
| readmitted | ||||||||||||
| <30 | 4.768249 | 3.028165 | 4.0 | 44.226028 | 19.276087 | 45.0 | 1.280884 | 1.635992 | 1.0 | 16.903143 | 8.096696 | 16.0 |
| >30 | 4.495541 | 2.988064 | 4.0 | 43.836601 | 19.567515 | 45.0 | 1.249599 | 1.669536 | 1.0 | 16.282768 | 7.618829 | 15.0 |
| NO | 4.254429 | 2.964964 | 3.0 | 42.381598 | 19.796262 | 44.0 | 1.410305 | 1.739693 | 1.0 | 15.670367 | 8.427628 | 14.0 |
print("\nPart 2: Summary statistics for the second group of numerical variables")
aggregated_numstats2
Part 2: Summary statistics for the second group of numerical variables
| number_outpatient | number_emergency | number_inpatient | number_diagnoses | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | median | mean | std | median | mean | std | median | mean | std | median | |
| readmitted | ||||||||||||
| <30 | 0.436911 | 1.302788 | 0.0 | 0.357313 | 1.370384 | 0.0 | 1.224003 | 1.954577 | 0.0 | 7.692789 | 1.773477 | 9.0 |
| >30 | 0.496329 | 1.544047 | 0.0 | 0.283669 | 1.194989 | 0.0 | 0.838993 | 1.393265 | 0.0 | 7.646898 | 1.812078 | 9.0 |
| NO | 0.273112 | 1.030704 | 0.0 | 0.109216 | 0.523609 | 0.0 | 0.381963 | 0.864301 | 0.0 | 7.221366 | 2.017054 | 8.0 |
numerical_df['readmitted'] = df_clean['readmitted']
# Aggregated Metrics Heatmap with Swapped Axes
def plot_aggregated_metrics_heatmap(df, numerical_columns, response_column, metric='mean'):
# Aggregate the data by the response column
agg_df = df.groupby(response_column, observed=False)[numerical_columns].agg(metric)
# Transpose the DataFrame to swap x and y axes
agg_df = agg_df.T
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
agg_df,
annot=True,
fmt=".3f",
cmap="coolwarm",
linewidths=0.5
)
plt.title(f'{metric.capitalize()} Values of Numeric Variables by {response_column}', fontsize=16)
plt.xlabel(response_column) # Swapped
plt.ylabel('Numeric Variables') # Swapped
plt.tight_layout()
# plt.savefig(f'plots/sumStats_heatmaps_mean.png')
plt.show()
# Example Usage
plot_aggregated_metrics_heatmap(
df=numerical_df,
numerical_columns=numerical_df.columns[:-1],
response_column='readmitted',
metric='mean'
)
# Aggregated Metrics Heatmap with Swapped Axes
def plot_aggregated_metrics_heatmap(df, numerical_columns, response_column, metric='std'):
# Aggregate the data by the response column
agg_df = df.groupby(response_column, observed=False)[numerical_columns].agg(metric)
# Transpose the DataFrame to swap x and y axes
agg_df = agg_df.T
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
agg_df,
annot=True,
fmt=".3f",
cmap="coolwarm",
linewidths=0.5
)
plt.title(f'{metric.capitalize()} Values of Numeric Variables by {response_column}', fontsize=16)
plt.xlabel(response_column) # Swapped
plt.ylabel('Numeric Variables') # Swapped
plt.tight_layout()
# plt.savefig(f'plots/sumStats_heatmaps_stdev.png')
plt.show()
# Example Usage
plot_aggregated_metrics_heatmap(
df=numerical_df,
numerical_columns=numerical_df.columns[:-1],
response_column='readmitted',
metric='std'
)
# Aggregated Metrics Heatmap with Swapped Axes
def plot_aggregated_metrics_heatmap(df, numerical_columns, response_column, metric='median'):
# Aggregate the data by the response column
agg_df = df.groupby(response_column, observed=False)[numerical_columns].agg(metric)
# Transpose the DataFrame to swap x and y axes
agg_df = agg_df.T
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
agg_df,
annot=True,
fmt=".3f",
cmap="coolwarm",
linewidths=0.5
)
plt.title(f'{metric.capitalize()} Values of Numeric Variables by {response_column}', fontsize=16)
plt.xlabel(response_column) # Swapped
plt.ylabel('Numeric Variables') # Swapped
plt.tight_layout()
# plt.savefig(f'plots/sumStats_heatmaps_median.png')
plt.show()
# Example Usage
plot_aggregated_metrics_heatmap(
df=numerical_df,
numerical_columns=numerical_df.columns[:-1],
response_column='readmitted',
metric='median'
)
- There isn't obvious class separation in the numerical variables. However, looking at the graphs and summary statistics, there does appear to be some difference in the following features.
- time_in_hospital
- num_lab_procedures
- num_medications
- number_emergency
- number_inpatient
- number_diagnoses
One more measure that we can use to assess the numeric values is an Empirical Cumulative Distribution Function. This graph represents the proportion of observations in the dataset that are less than or equal to a particular value. The x-axis shows the values of the variable and the y-axis shows the cumulative probability for that value. Each step represents an observation in the data, with the height of the step indicating the proporion of observations that have values less than or equal to the corresponding value on the x-axis. This will give us a visual summary of the data, including the extreme values, central tendencies, and spread to quickly understand what proportion of our data falls below certain thresholds.
if 'readmitted' in numerical_df:
del numerical_df['readmitted']
def ecdf(data, axes):
"""
ECDF plot with percentile markers, confidence intervals, and summary metrics.
"""
# Sort data and compute ECDF
x = np.sort(data)
y = np.arange(1, len(x) + 1) / len(x)
# Plot ECDF
axes.step(x, y, where="post", color="blue", label="ECDF")
# Highlight percentiles
percentiles = [0.25, 0.5, 0.75]
percentile_values = np.percentile(data, [25, 50, 75])
for p, value in zip(percentiles, percentile_values):
axes.axhline(p, linestyle="--", color="gray", alpha=0.7)
axes.axvline(value, linestyle="--", color="gray", alpha=0.7)
axes.text(value, p, f"{value:.2f}", fontsize=9, verticalalignment="bottom")
# Add confidence intervals
n = len(data)
ci_band = 1.36 / np.sqrt(n) # Approximation for 95% CI
axes.fill_between(x, y - ci_band, y + ci_band, color="gray", alpha=0.2, label="95% CI")
# Annotate summary statistics
mean, median = np.mean(data), np.median(data)
axes.text(0.05, 0.9, f"Mean: {mean:.2f}\nMedian: {median:.2f}",
transform=axes.transAxes, fontsize=10, bbox=dict(facecolor='white', alpha=0.5))
# Add labels and title
axes.set_title(f"ECDF: {column}", fontsize=12)
axes.set_xlabel("Value", fontsize=10)
axes.set_ylabel("Cumulative Probability", fontsize=10)
fig, axes = create_variable_grid(len(numerical_df.columns))
for i, column in enumerate(numerical_df.columns):
ecdf(numerical_df[column].dropna(), axes[i])
plt.tight_layout()
# plt.savefig(f'plots/ECDFplots.png')
plt.show()
Empirical Cumulative Distribution Function:
- time_in_hospital:
- Mean: 4.40, Median: 4.00
- Most patients stay for 2–6 days, with very few staying beyond 10 days.
- The distribution is concentrated around the median, with a small tail extending toward longer durations.
- num_lab_procedures:
- Mean: 43.10, Median: 44.00
- The distribution is symmetric around 40–50 procedures, with most values between 20 and 80.
- Few patients have lab procedures below 20 or above 100.
- num_procedures:
- Mean: 1.34, Median: 1.00
- Most patients have fewer than 2 procedures, with discrete steps indicating a small number of unique values.
- Outliers are visible for values above 4 procedures.
- num_medications:
- Mean: 16.02, Median: 15.00
- The distribution is concentrated between 10 and 30 medications, with a gradual tail extending up to 80.
- A large proportion of patients fall close to the mean.
- number_outpatient:
- Mean: 0.37, Median: 0.00
- The majority of patients (over 90%) have zero outpatient visits.
- A small fraction of the population accounts for higher outpatient numbers, extending to a maximum of ~40.
- number_emergency:
- Mean: 0.20, Median: 0.00
- Similar to outpatient visits, most patients (over 95%) have zero emergency visits.
- Outliers extend to values above 70, though they represent a very small percentage of the dataset.
- number_inpatient:
- Mean: 0.64, Median: 0.00
- Over 80% of patients have no inpatient visits, while the remaining show a gradual distribution up to ~20 visits.
- A significant tail exists for higher values.
- number_diagnoses:
- Mean: 7.42, Median: 8.00
- The distribution is centered around the median, with most patients having between 6 and 9 diagnoses.
- Very few patients have fewer than 4 or more than 12 diagnoses, making the distribution relatively compact.
3.4.3 Categorical features¶
This plots the percentages of each readmitted class within the levels of each categorical variables in bar plots. Only the 12 most frequency categorical levels are displayed. It helps us get a sense of if and how the distribution of each response class changes between categorical levels.
# Set the color palette
palette = sns.color_palette("muted", n_colors=len(readmit_order))
def calculate_proportions(df, catvar, response):
"""
Calculate proportions of response classes for each level of a categorical variable.
"""
# Count occurrences of each combination of catvar and response
counts = df.groupby([catvar, response], observed=False).size().reset_index(name='count')
# Calculate proportions within each level of the categorical variable
counts['proportion'] = counts.groupby(catvar, observed=False)['count'].transform(lambda x: x / x.sum())
return counts
def plot_categorical_proportions(df, groups, response, ncols=2, top_n=12, show_empty=True):
"""
Create bar plots of proportions for categorical variables by response class.
"""
for group_name, variables in groups.items():
# Number of rows needed
nrows = -(-len(variables) // ncols) # Ceiling division
# Create subplots
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, nrows * 5))
axes = axes.flatten() # Flatten axes for easy iteration
# Plot each variable in the group
for i, var in enumerate(variables):
if var in df.columns: # Check if variable exists in DataFrame
# Calculate proportions
proportions = calculate_proportions(df, var, response)
# Optionally exclude empty categories
if not show_empty:
# Remove rows with zero counts
proportions = proportions[proportions['count'] > 0]
# Dynamically adjust categories to remove unused ones
if isinstance(proportions[var].dtype, pd.CategoricalDtype):
proportions[var] = proportions[var].cat.remove_unused_categories()
# Limit to top_n categories by total count
top_categories = proportions.groupby(var, observed=False)['count'].sum().nlargest(top_n).index
plot_data = proportions[proportions[var].isin(top_categories)]
# Bar plot
sns.barplot(
data=plot_data,
x=var,
y='proportion',
hue=response,
palette=palette,
ax=axes[i]
)
axes[i].set_title(f'{group_name}: {var}', fontsize=12)
axes[i].set_ylabel('Proportion')
axes[i].set_xlabel(var)
axes[i].tick_params(axis='x', rotation=45)
axes[i].legend(title=response, loc='upper right')
else:
axes[i].set_visible(False) # Hide empty plots
# Hide unused axes
for j in range(len(variables), len(axes)):
axes[j].set_visible(False)
# Adjust layout
plt.tight_layout()
plt.suptitle(f'Group: {group_name}', fontsize=16, y=1.02) # Add a super title
# plt.savefig(f'plots/plot_{group_name}_by_readmitted.png', bbox_inches="tight")
plt.show()
# Grouping categorical variables for plotting
groups = {
"Demographics": ['race', 'gender', 'age'], # excluded patient_nbr (and weight, which was dropped)
"Administrative": ['admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'payer_code', 'medical_specialty'],
"Diagnoses": ['diag_1', 'diag_2', 'diag_3'],
"Tests": ['max_glu_serum', 'A1Cresult'],
"FirstLineDrugs": ['insulin', 'metformin'],
"Meglitinides": ['repaglinide', 'nateglinide'],
"Sulfonylureas": ['chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'tolazamide'],
"TZDs": ['pioglitazone', 'rosiglitazone', 'troglitazone'],
"AGIs": ['acarbose', 'miglitol'],
"Combos": ['glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone'],
"Treatment": ['change', 'diabetesMed']
}
# Call the function to create bar plots
plot_categorical_proportions(df_clean, groups, response='readmitted', ncols=2, top_n=12, show_empty=False)
- Comparing the trends between the response classes across categorical levels, there are no stand-out examples of class separation between levels. We also need to bear in mind that many of these features have a large imbalance in their categorical levels (e.g. many of the drug variables). However, there are some differences where it looks like feature level may be associate with response class. Examples include:
- race
- admission_type_id
- discharge_disposition_id
- admission_source_id
- payer_code
- medical_specialty
- diag_1, diag_2, diag_3
- max_glu_serum
- nearly all the drug variables
- change
- diabetesMed
subset_df = df_clean.select_dtypes(include=['object', 'category'])
subset_df = subset_df.drop(['medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'patient_nbr', 'admission_type_id', 'max_glu_serum', 'A1Cresult'], axis=1)
subset_df.columns
Index(['race', 'gender', 'age', 'discharge_disposition_id',
'admission_source_id', 'payer_code', 'metformin', 'repaglinide',
'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide',
'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
'insulin', 'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-rosiglitazone',
'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
dtype='object')
response = 'readmitted' # Attribute of interest
other_columns = [col for col in subset_df.columns if col != response] # Exclude target
crosstab_results = {}
for column in other_columns:
crosstab_results[column] = pd.crosstab(subset_df[response], subset_df[column])
def create_heatmaps(crosstab_results, response, normalization='none', n_cols=4, cmap="plasma"):
"""
Creates a grid of heatmaps for given crosstab results.
Parameters:
crosstab_results (dict): Dictionary of crosstabs for each feature.
response (str): Name of the response variable.
normalization (str): Normalization type ('row', 'column', 'overall').
n_cols (int): Number of columns in the subplot grid.
cmap (str): Colormap for the heatmaps.
"""
n_features = len(crosstab_results) # Total features to plot
n_rows = -(-n_features // n_cols) # Compute rows by rounding up
# Create the figure and axes for subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 4 * n_rows), sharex=False, sharey=False)
axes = np.ravel(axes) # Flatten axes for consistent 1D handling
# Loop through crosstabs and create heatmaps
for i, (col, crosstab) in enumerate(crosstab_results.items()):
# Normalize the crosstab based on the selected normalization
if normalization == 'none':
normalized_crosstab = crosstab
elif normalization == 'row':
normalized_crosstab = crosstab.div(crosstab.sum(axis=1), axis=0) # Normalize by row
elif normalization == 'column':
normalized_crosstab = crosstab.div(crosstab.sum(axis=0), axis=1) # Normalize by column
elif normalization == 'overall':
normalized_crosstab = crosstab / crosstab.values.sum() # Normalize overall
else:
raise ValueError("Invalid normalization type. Choose 'none', row', 'column', or 'overall'.")
# Plot heatmap
sns.heatmap(normalized_crosstab, annot=True, fmt=".2f" if normalization != 'none' else "d", cmap=cmap, cbar=False, ax=axes[i])
axes[i].set_title(f"Crosstab of {response} vs {col}")
axes[i].set_xlabel(col)
axes[i].set_ylabel(response)
# Remove unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
# Adjust layout for better spacing
plt.tight_layout()
# plt.savefig(f'plots/crossTab_heatmaps.png')
plt.show()
# create_heatmaps(crosstab_results, response='readmitted', normalization='none', n_cols=4, cmap="plasma") # Crosstab Heatmap: No Normalization, Raw Counts
# create_heatmaps(crosstab_results, response='readmitted', normalization='row', n_cols=4, cmap="plasma") # Crosstab Heatmap: No Normalization, Proportion of Factor Levels within the Response Variable
# create_heatmaps(crosstab_results, response='readmitted', normalization='column', n_cols=4, cmap="plasma") # Crosstab Heatmap: Column Normalization, Proportion of Factor Levels within the Predictor Variable
create_heatmaps(crosstab_results, response='readmitted', normalization='overall', n_cols=3, cmap="plasma") # Crosstab Heatmap: Overall Normalization, Proportion of Total Data
3.4.4 Feature selection¶
Here, we want to do some feature selection to help get a sense of the most important features. First, we start by preprocessing categorical features, ordering the ordinal ones and then one-hot encoding.
A. Preprocessing for LASSO¶
# Preprocessing and dataset checks for LASSO feature selection
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Create a copy of df_clean to avoid modifying the original dataframe
df_encoded = df_clean.copy()
# Define the order for drug variables
drug_order = {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}
# List of drug-related variables
drug_columns = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'tolazamide',
'pioglitazone', 'rosiglitazone', 'troglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'glipizide-metformin',
'metformin-rosiglitazone', 'metformin-pioglitazone', 'glimepiride-pioglitazone']
# Apply the ordinal mapping to drug variables
for col in drug_columns:
if col in df_encoded.columns:
df_encoded[col] = df_encoded[col].map(drug_order)
# Encode the age ranges as ordinal
df_encoded['age'] = pd.Categorical(df_encoded['age'], categories=age_order, ordered=True).codes
# Preprocess diag_1, diag_2, diag_3 combining all
for col in ['diag_1', 'diag_2', 'diag_3']:
df_encoded[col] = df_encoded[col].str.split('.').str[0] # Drop decimals and digits after
# Group less frequent categories into other
for col in ['diag_1', 'diag_2', 'diag_3']:
top_categories = df_encoded[col].value_counts().nlargest(20).index # without this this code took forever to run
df_encoded[col] = df_encoded[col].apply(lambda x: x if x in top_categories else 'Other')
# # Replace NaNs in 'medical_specialty' and 'payer_code' with 'unknown'
# categorical_na_columns = ['medical_specialty', 'payer_code']
# for col in categorical_na_columns:
# df_encoded[col] = df_encoded[col].fillna('unknown')
# Drop highly missing columns
columns_to_drop = ['patient_nbr', 'max_glu_serum', 'A1Cresult'] # weight is already dropped
df_encoded = df_encoded.drop(columns=columns_to_drop)
# Check for remaining missing values
# print("Missing values before dropping rows:")
# print(df_encoded.isnull().sum())
# Drop rows with NaN values
df_encoded = df_encoded.dropna()
# Or impute them with the mode (or use KNN imputer)
# for col in ['diag_1', 'diag_2', 'race']:
# df_encoded[col] = df_encoded[col].fillna(df_encoded[col].mode()[0])
# Ensure the dataframe still has rows
if df_encoded.shape[0] == 0:
raise ValueError("No rows left after dropping NaNs. Consider imputing missing values instead.")
# Check the shape of the final dataset
# print(f"Shape after preprocessing: {df_encoded.shape}")
# Define predictors (X) and response (y)
X = df_encoded.drop(columns=['readmitted']) # Exclude the target variable
y = df_encoded['readmitted'].cat.codes # Encode the target variable as 0, 1, 2
# Separate numeric and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
# Define the preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features), # Scale numeric features
('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features) # One-hot encode categorical features
# drop='first' avoids multicollinarity issues important for regression not RF
# handle_unknown='ignore' is important for sparse categories that might not show up in train and test splits
]
)
# Apply preprocessing to X
X_preprocessed = preprocessor.fit_transform(X)
# Check the shape of the preprocessed dataset
print(f"Preprocessed dataset shape: {X_preprocessed.shape}")
Preprocessed dataset shape: (101766, 262)
B. LASSO penalized regression¶
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import log_loss, accuracy_score
from sklearn.model_selection import train_test_split
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.3, random_state=1234)
# Fit LASSO logistic regression (LogisticRegressionCV with L1 penalty)
lasso = LogisticRegressionCV(
penalty='l1', # LASSO penalty
solver='saga', # Supports L1 regularization
Cs=np.logspace(-3, 3, 20), # Lambda values (inverse of regularization strength)
cv=5, # 5-fold cross-validation
scoring='neg_log_loss', # Metric: log loss
random_state=1234,
max_iter=5000,
n_jobs=-1 #Use all CPU cores
)
lasso.fit(X_train, y_train)
# Extract the best lambda (C is the inverse of lambda)
best_lambda = 1 / lasso.C_[0]
print(f"Best lambda (penalty strength): {best_lambda}")
# Extract the feature names
feature_names = (
list(numeric_features) +
list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))
)
# Extract the coefficients
coefficients = pd.DataFrame({
'Feature': feature_names,
'Coefficient': lasso.coef_[0]
}).sort_values(by='Coefficient', ascending=False)
print("\nLASSO Coefficients:")
print(coefficients)
# Identify features not pushed to zero
selected_features = coefficients[coefficients['Coefficient'] != 0]
print("\nSelected Features (Non-zero Coefficients):")
print(selected_features)
# Evaluate the LASSO model
y_pred = lasso.predict(X_test)
y_pred_proba = lasso.predict_proba(X_test)
print("\nModel Evaluation:")
print(f"Log Loss: {log_loss(y_test, y_pred_proba):.4f}")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Best lambda (penalty strength): 2.9763514416313193
LASSO Coefficients:
Feature Coefficient
46 discharge_disposition_id_28 1.702512
35 discharge_disposition_id_15 1.371556
41 discharge_disposition_id_22 1.208156
93 medical_specialty_Hematology 0.861029
25 discharge_disposition_id_5 0.823082
.. ... ...
214 metformin_3 -0.214204
107 medical_specialty_Orthopedics-Reconstructive -0.222754
159 diag_1_486 -0.223664
109 medical_specialty_Otolaryngology -0.254790
42 discharge_disposition_id_23 -0.276486
[262 rows x 2 columns]
Selected Features (Non-zero Coefficients):
Feature Coefficient
46 discharge_disposition_id_28 1.702512
35 discharge_disposition_id_15 1.371556
41 discharge_disposition_id_22 1.208156
93 medical_specialty_Hematology 0.861029
25 discharge_disposition_id_5 0.823082
.. ... ...
214 metformin_3 -0.214204
107 medical_specialty_Orthopedics-Reconstructive -0.222754
159 diag_1_486 -0.223664
109 medical_specialty_Otolaryngology -0.254790
42 discharge_disposition_id_23 -0.276486
[99 rows x 2 columns]
Model Evaluation:
Log Loss: 0.8730
Accuracy: 0.5861
import re
# Separate numerical features before processing categorical ones
numeric_features = coefficients['Feature'][coefficients['Feature'].isin(X.select_dtypes(include=['int64', 'float64']).columns)]
categorical_features = coefficients['Feature'][~coefficients['Feature'].isin(numeric_features)]
# Apply regex to remove one-hot encoding suffixes only for categorical features
original_features = categorical_features.str.replace(r'_[^_]+$', '', regex=True)
# Combine the cleaned categorical features with the numerical features
unique_features = pd.concat([numeric_features, original_features]).unique()
print("Unique Original Features:")
print(unique_features)
# List of features in the dataset (from preprocessing)
dataset_features = X.columns
# Cleaned unique features (from LASSO coefficients)
lasso_unique_features = pd.Series(unique_features)
# Find common features
common_features = lasso_unique_features[lasso_unique_features.isin(dataset_features)]
# Find features in the dataset but not selected by LASSO
unused_features = pd.Series(dataset_features[~dataset_features.isin(lasso_unique_features)])
# # Print results
# print("Common Features (Selected by LASSO and Present in Dataset):")
# print(common_features)
print("\nUnused Features (Present in Dataset but NOT Selected by LASSO):")
print(unused_features)
Unique Original Features: ['number_inpatient' 'num_medications' 'number_emergency' 'num_lab_procedures' 'time_in_hospital' 'num_procedures' 'number_outpatient' 'number_diagnoses' 'discharge_disposition_id' 'medical_specialty' 'diag_1' 'payer_code' 'glipizide' 'diag_3' 'insulin' 'diag_2' 'metformin' 'admission_source_id' 'race' 'nateglinide' 'change' 'glyburide' 'repaglinide' 'metformin-rosiglitazone' 'glimepiride-pioglitazone' 'glipizide-metformin' 'glyburide-metformin' 'acarbose' 'metformin-pioglitazone' 'chlorpropamide' 'glimepiride' 'acetohexamide' 'miglitol' 'tolbutamide' 'pioglitazone' 'rosiglitazone' 'tolazamide' 'troglitazone' 'gender' 'admission_type_id' 'diabetesMed'] Unused Features (Present in Dataset but NOT Selected by LASSO): 0 age dtype: object
Penalized regression with LASSO only removes age.
Next we will try Random Forest.
C. Random forest for variable importance¶
from sklearn.ensemble import RandomForestClassifier
# Define predictors (X) and response (y)
X = df_encoded.drop(columns=['readmitted']) # Exclude the target variable
y = df_encoded['readmitted'].cat.codes # Encode the target variable as 0, 1, 2
# Separate numeric and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
# Define the preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features), # Scale numeric features
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
]
)
# Apply preprocessing to X
X_preprocessed = preprocessor.fit_transform(X)
# Check the shape of the preprocessed dataset
print(f"Preprocessed dataset shape: {X_preprocessed.shape}")
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.3, random_state=1234)
# Train a Random Forest model
rf = RandomForestClassifier(random_state=1234)
rf.fit(X_train, y_train)
# Get feature importance
importances = rf.feature_importances_
# Combine feature names and importances
# Retrieve feature names from the preprocessor
numeric_features = list(numeric_features)
categorical_features = list(preprocessor.named_transformers_['cat'].get_feature_names_out())
all_feature_names = numeric_features + categorical_features
# Ensure the lengths match
if len(all_feature_names) != len(importances):
raise ValueError(f"Mismatch between feature names ({len(all_feature_names)}) and importances ({len(importances)}).")
# Create a DataFrame for feature importances
feature_importances = pd.DataFrame({
'Feature': all_feature_names,
'Importance': importances
}).sort_values(by='Importance', ascending=False)
# Display feature importance
print("Feature Importance:")
print(feature_importances)
# Set a threshold for importance (e.g., remove features with very low importance)
threshold = 0.01 # Keep features with importance > 1%
selected_features = feature_importances[feature_importances['Importance'] > threshold]
print("\nSelected Features:")
print(selected_features)
# Filter X_train and X_test based on selected features
selected_indices = selected_features.index
X_train_selected = X_train[:, selected_indices]
X_test_selected = X_test[:, selected_indices]
Preprocessed dataset shape: (101766, 295)
Feature Importance:
Feature Importance
1 num_lab_procedures 6.914669e-02
3 num_medications 6.335111e-02
0 time_in_hospital 4.773139e-02
6 number_inpatient 3.891731e-02
7 number_diagnoses 3.463782e-02
.. ... ...
63 admission_source_id_14 6.224589e-08
134 medical_specialty_Psychiatry-Addictive 0.000000e+00
132 medical_specialty_Proctology 0.000000e+00
128 medical_specialty_Perinatology 0.000000e+00
290 metformin-pioglitazone_2 0.000000e+00
[295 rows x 2 columns]
Selected Features:
Feature Importance
1 num_lab_procedures 0.069147
3 num_medications 0.063351
0 time_in_hospital 0.047731
6 number_inpatient 0.038917
7 number_diagnoses 0.034638
2 num_procedures 0.032493
4 number_outpatient 0.016526
219 diag_3_Other 0.015195
14 gender_Female 0.014016
15 gender_Male 0.013818
5 number_emergency 0.013592
178 diag_1_Other 0.013543
157 medical_specialty_Unknown 0.013516
200 diag_2_Other 0.013181
75 payer_code_MC 0.012627
84 payer_code_Unknown 0.012415
10 race_Caucasian 0.011001
201 diag_3_250 0.010885
25 discharge_disposition_id_1 0.010603
import re
# Retrieve feature names from the preprocessor
# For categorical features, OneHotEncoder creates feature names based on categories
categorical_features_rf = preprocessor.named_transformers_['cat'].get_feature_names_out()
numeric_features_rf = list(numeric_features)
# Combine numeric and categorical feature names
all_feature_names_rf = numeric_features_rf + list(categorical_features_rf)
# Ensure the lengths match with the importance values
if len(all_feature_names_rf) != len(importances):
raise ValueError(f"Mismatch between feature names ({len(all_feature_names_rf)}) and importances ({len(importances)}).")
# Map selected feature indices back to their names
selected_features_rf = feature_importances[feature_importances['Importance'] > threshold]
selected_feature_indices = selected_features_rf.index
selected_feature_names = [all_feature_names_rf[i] for i in selected_feature_indices]
# Separate numerical and categorical selected features
selected_numeric_features = [feature for feature in selected_feature_names if feature in numeric_features_rf]
selected_categorical_features = [feature for feature in selected_feature_names if feature not in numeric_features_rf]
# Apply regex to clean one-hot encoded names for categorical features only
cleaned_selected_categorical_features = [re.sub(r'_[^_]+$', '', feature) for feature in selected_categorical_features]
# Combine cleaned categorical features with numeric features
unique_selected_features = pd.Series(selected_numeric_features + cleaned_selected_categorical_features).unique()
print("Unique Original Features from Random Forest:")
print(unique_selected_features)
# Compare with dataset features
dataset_features_rf = list(X.columns)
# Find common features
common_rf_features = [feature for feature in unique_selected_features if feature in dataset_features_rf]
# Find unused features in the dataset
unused_rf_features = [feature for feature in dataset_features_rf if feature not in unique_selected_features]
# Print results
print("\nCommon Features (Selected by Random Forest and Present in Dataset):")
print(common_rf_features)
print("\nUnused Features (Present in Dataset but NOT Selected by Random Forest):")
print(unused_rf_features)
Unique Original Features from Random Forest: ['num_lab_procedures' 'num_medications' 'time_in_hospital' 'number_inpatient' 'number_diagnoses' 'num_procedures' 'number_outpatient' 'number_emergency' 'diag_3' 'gender' 'diag_1' 'medical_specialty' 'diag_2' 'payer_code' 'race' 'discharge_disposition_id'] Common Features (Selected by Random Forest and Present in Dataset): ['num_lab_procedures', 'num_medications', 'time_in_hospital', 'number_inpatient', 'number_diagnoses', 'num_procedures', 'number_outpatient', 'number_emergency', 'diag_3', 'gender', 'diag_1', 'medical_specialty', 'diag_2', 'payer_code', 'race', 'discharge_disposition_id'] Unused Features (Present in Dataset but NOT Selected by Random Forest): ['age', 'admission_type_id', 'admission_source_id', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed']
After preprocessing the data, there are ~16 selected features. Age, the admission variables, and the medication variables are mostly excluded.
3.4.5 Important features¶
Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why the chosen visualization is appropriate.¶
Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).
This replots several of the variables that seemed important based on summary statistics, the plots above, and the random forest variable importance. To see the relationships a little differently, violin plots and stacked bar plots are used. Violin plots are appropriate visualizations for continuous variables. In addition to the median and IQR information also provide by boxplots, they reveal where data points are clustered thoroughout the distribution. The stacked bar plots are appropriate for categorical variables because they provide a side-by-side comparison of the proportion of each response class within each categorical level of the variable. This side-by-side visualization helps display where category may influence the response class.
from textwrap import wrap
df_plotting = df_clean.copy()
# Filter diag_1 and medical specialty to include only categories with more than 1000 occurrences
diag_1_top_values = df_plotting['diag_1'].value_counts()
diag_1_top_values = diag_1_top_values[diag_1_top_values > 1000].index
df_plotting['diag_1_limited'] = df_plotting['diag_1'].where(df_plotting['diag_1'].isin(diag_1_top_values), other='Other')
med_spec_top_values = df_plotting['medical_specialty'].value_counts()
med_spec_top_values = med_spec_top_values[med_spec_top_values > 500].index
df_plotting['medical_specialty_limited'] = df_plotting['medical_specialty'].where(df_plotting['medical_specialty'].isin(med_spec_top_values), other='Other')
# Define numeric and categorical variables
numVar = ['time_in_hospital', 'num_medications', 'number_emergency', 'number_inpatient']
catVar = ['payer_code', 'diag_1_limited', 'medical_specialty_limited', 'discharge_disposition_id']
# Violin Plots for Numeric Variables in Two Columns
ncols = 2 # Number of columns in the grid
nrows = -(-len(numVar) // ncols) # Calculate the number of rows needed (ceiling division)
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(14, 6 * nrows)) # Adjust size for clarity
axes = axes.flatten() # Flatten axes for easier indexing if nrows * ncols > len(numVar)
for i, var in enumerate(numVar):
ax = axes[i] # Get the appropriate subplot axis
sns.violinplot(
data=df_plotting,
x=var,
y='readmitted',
hue = 'readmitted',
order=readmit_order,
palette='muted',
ax=ax
)
ax.set_title(f'Violin Plot of {var} by Readmitted')
ax.set_xlabel(var)
ax.set_ylabel('Readmitted')
# Hide any unused subplots
for j in range(len(numVar), len(axes)):
axes[j].axis('off')
plt.tight_layout()
# plt.savefig(f'plots/grouped_violinplots.png')
plt.show()
# Stacked Barplots for Categorical Variables in Two Columns
ncols = 2 # Number of columns in the grid
nrows = -(-len(catVar) // ncols) # Calculate the number of rows needed (ceiling division)
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(14, 6 * nrows))
axes = axes.flatten() # Flatten axes for easier indexing if nrows * ncols > len(catVar)
for i, var in enumerate(catVar):
ax = axes[i] # Get the appropriate subplot axis
# Calculate proportions of response classes within each level of the categorical variable
grouped_counts = (
df_plotting.groupby(var, observed=True)['readmitted']
.value_counts(normalize=True)
.unstack(fill_value=0) * 100
)
# Plot stacked barplot
grouped_counts[readmit_order].plot(
kind='bar',
stacked=True,
ax=ax,
color=sns.color_palette("muted"),
edgecolor='black',
width=0.8
)
ax.set_title(f'Stacked Barplot of {var} by Readmitted')
ax.set_xlabel(var)
ax.set_ylabel('Percentage (%)')
ax.legend(title='Readmitted', loc='upper right')
ax.tick_params(axis='x', rotation=0) # Prevent rotation here for cleaner wrapping
# Dynamically fetch and wrap the x-tick labels
labels = [ '\n'.join(wrap(label, 10)) for label in grouped_counts.index.astype(str)]
ax.set_xticks(range(len(labels))) # Ensure tick positions match
ax.set_xticklabels(labels, rotation=90) # Apply wrapped labels
# Hide any unused subplots
for j in range(len(catVar), len(axes)):
axes[j].axis('off')
plt.tight_layout()
# plt.savefig(f'plots/stacked_barplots.png')
plt.show()
Observations on the Plots
Violin Plots:
- time_in_hospital: The distribution for early readmitted patients is smoother with a slightly higher median compared to those not readmitted. All distributions are right-skewed.
- num_medications: There are subtle differences in the range and medians.
- Late readmissions have a slightly narrower range that the other two classes.
- The 'NO' class has the smallest median.
- Late readmitted has the largest median.
- number_emergency: Differences appear in the tails of the distributions:
- The 'NO' class has the smallest range.
- Late readmitted has the largest range.
- Early readmitted falls in between. This could suggest that late readmitted patients may experience more unexpected health challenges, often treated in emergency settings. Most observations fall between 0 and a few visits, but early readmitted patients have more observations in the "few visits" range compared to zero visits.
- number_inpatient: The range increases across classes:
- The 'NO' class has the smallest range.
- Early readmitted has the largest range. For the 'NO' class, most observations are at zero, whereas for early readmitted patients, the distribution becomes denser in the 0-5 range.
Stacked Bar Plots:
- payer_code:
- All observations from the 'FR' category are not readmitted, possibly indicating very few observations in this category.
- 'MP' and 'DM' show the smallest proportions of not readmitted cases.
- diag_1: Among the 24 most frequent primary diagnosis codes plus an other category:
- Codes with the lowest proportions of not readmitted patients include 250.6, 428, 491, and 493.
- Code 715 has the highest proportion of not readmitted patients.
- Codes 786 and 435 have the fewest early readmitted patients.
- medical_specialty: Among the 16 most frequent medical specialties plus other and unknown categories:
- Nephrology is the specialty with the lowest proportion of not readmitted and highest proportion of early readmitted patients.
- Specialties with the smallest proportion of early readmitted and largest proportion of not readmitted patients are Obstetrics/Gynecology and Surgery - Cardiovascular/Thoracic.
- discharge_disposition_id:
- Categories 11, 19, and 20 show no readmitted observations, likely due to having very few total observations.
- Category 12 has the highest percentage of early readmitted patients and one of the lowest percentages of not readmitted patients, with no late readmitted observations.
- Category 15 shows the lowest proportion of not readmitted and among the highest of early readmitted.
# Scatter plot matrix of the numeric variables
# This doesn't work.
# import warnings
# # Suppress DeprecationWarning and FutureWarning
# with warnings.catch_warnings():
# warnings.filterwarnings('ignore', category=DeprecationWarning)
# warnings.filterwarnings('ignore', category=FutureWarning)
# Sample 10% of the 100K record dataset
sample_size = int(0.1 * len(df_clean))
df_sampled = pd.concat(
[group.sample(frac=0.1) for _, group in df_clean.groupby('readmitted', observed=False)],
axis=0
).reset_index(drop=True)
# Ensure the 'readmitted' column is present explicitly
df_sampled = df_sampled.loc[:, df_clean.columns]
# Add jitter to numeric variables
df_sampled_jittered = df_sampled.copy()
numeric_cols = df_sampled.select_dtypes(include=['int64']).columns
df_sampled_jittered[numeric_cols] = df_sampled_jittered[numeric_cols].values + \
np.random.rand(len(df_sampled_jittered), len(numeric_cols)) / 2
# Create the pairplot
sns.pairplot(
data=df_sampled_jittered,
hue="readmitted", # Group by 'readmitted'
diag_kws={"common_norm": False}, # Default ensures densities are group-specific
);
# plt.savefig(f'plots/matrixScatter.png')
Observations from the scatterplot:
Pairwise relationships: There doesn't appear to be a strong linear trend between any pairs of variables, suggesting that the numeric variables aren't strongly correlated and multicollearity isn't much of a concern.
Distributions: The kernel density plots show that many of the variables (e.g. time_in_hospital, number_outpatient, and number_emergency) appear to have skewed distributions, with a majority of values concentrated at the lower end of the range. num_lab_procedures has a more symmetric distribution compared to other variables, with a relatively broader range of values.
Outliers: Some variables (e.g. number_outpatient, number_emergency, and number_diagnoses) have extreme outlier values. These should be explored further.
Relationship with the target variable: The hospital visit variables (e.g. number_outpatient, number_emergency and number_inpatient) have more data points in the readmitted '<30' and '>30' categories with higher values, which could indicate that frequent hospital visits correlate with the likelihood of readmission. num_medications appear less strongly related to readmitted categories compared to visit-related metrics, given the spread of the not readmitted data points throughout the range.
B. Correlation matrix
# Select only numeric columns for the correlation matrix
numeric_df = df_clean.select_dtypes(include=['int64', 'float64'])
# Calculate the correlation matrix
corr_matrix = numeric_df.corr()
# Create the plot
f, ax = plt.subplots(figsize=(8, 8)) # Figure and axes
sns.heatmap(corr_matrix,
annot=True, # Plot numeric annotations
fmt=".2f", # Format for annotations
square=True, # Keep cells square
cmap=sns.color_palette("coolwarm", as_cmap=True),
cbar_kws={"shrink": 0.7}, # Colorbar customization
ax=ax) # Use the same axis
# Adjust layout
f.tight_layout()
# plt.savefig(f'plots/correlationplt.png')
plt.show()
From the correlation plot, the numerical variables don't show a lot of evidence of correlation or multicollinearity. num_medications is somewhat correlated with time_in_hospital (0.47) and num_procedures (0.39).
C. Cross tabulations
It would be interesting to get a sense of the discharges by payer. Are some providers more likely to pay for rehabilitation or other services after discharge?
# Filter to include only categories with more than 2000 occurrences
discharge_top_values = df_plotting['discharge_disposition_id'].value_counts()
discharge_top_values = discharge_top_values[discharge_top_values > 2000].index
# Ensure 'Other' is included in the categories
if 'Other' not in df_plotting['discharge_disposition_id'].cat.categories:
df_plotting['discharge_disposition_id'] = df_plotting['discharge_disposition_id'].cat.add_categories(['Other'])
# Apply the where condition to limit categories
df_plotting['discharge_id_limited'] = df_plotting['discharge_disposition_id'].where(
df_plotting['discharge_disposition_id'].isin(discharge_top_values),
other='Other'
)
# Create cross-tab
discharge_crosstab = pd.crosstab(df_plotting['payer_code'], df_plotting['discharge_id_limited'])
print(discharge_crosstab)
# Normalize the cross-tab to calculate rates
discharge_rate = discharge_crosstab.div(discharge_crosstab.sum(1).astype(float), axis=0)
# Plot the cross-tab and normalized rates
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Plot cross-tab
discharge_crosstab.plot(kind='bar', stacked=True, colormap='Spectral_r', ax=axes[0])
axes[0].set_title('Cross-tab: Payer Code vs. Discharge ID')
axes[0].set_ylabel('Counts')
axes[0].set_xlabel('Payer Code')
# Plot normalized rates
discharge_rate.plot(kind='bar', stacked=True, colormap='Spectral_r', ax=axes[1])
axes[1].set_title('Normalized Rates: Payer Code vs. Discharge ID')
axes[1].set_ylabel('Proportions')
axes[1].set_xlabel('Payer Code')
plt.tight_layout()
# plt.savefig(f'plots/crosstab_payer_discharge.png')
plt.show()
discharge_id_limited 1 2 3 6 18 Other payer_code BC 3711 147 189 421 0 187 CH 119 1 4 15 0 7 CM 1129 41 351 282 1 133 CP 1908 65 147 264 3 146 DM 397 8 27 82 2 33 FR 1 0 0 0 0 0 HM 4158 256 533 1047 2 278 MC 16024 688 7659 5052 9 3007 MD 2362 42 510 318 0 300 MP 59 3 4 9 0 4 OG 840 33 22 55 0 83 OT 83 2 2 3 0 5 PO 445 5 31 73 0 38 SI 42 2 3 6 0 2 SP 3548 117 545 459 0 338 UN 1865 63 181 174 6 159 Unknown 23449 655 3728 4628 3668 4128 WC 94 0 18 14 0 9
This displays the most frequent discharge codes for each payer. The discharge code descriptions follow:
- 1: Discharged to home
- 2: Discharged/transferred to another short-term hospital
- 3: Discharged/transferred to SNF
- 6: Discharged/transferred to home with home health service
- 18: NULL (I think this is different than 'died'. The documentation has other categories for 'expired'.)
'MC' patients have the smallest proportion discharged to home and with CM the largest discharged to SNF, which may be a supervised nursing facility. There is quite a bit of variability between these two variables.
Additionally, are some payers more likely to have patients that get emergency treatment?
# Define bin edges and labels
bins = [0, 1, 3, 5, 10, float('inf')] # Define bin edges
labels = ['0', '1-3', '4-5', '6-10', '11+'] # Define category labels
# Bin the 'number_emergency' variable
df_clean['number_emergency_binned'] = pd.cut(df_clean['number_emergency'], bins=bins, labels=labels, right=False)
# Create cross-tab
emergency_crosstab = pd.crosstab(df_clean['payer_code'], df_clean['number_emergency_binned'])
print(emergency_crosstab)
# Normalize the cross-tab to calculate rates
emergency_rate = emergency_crosstab.div(emergency_crosstab.sum(1).astype(float), axis=0)
# Plot the cross-tab and normalized rates
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Plot cross-tab
emergency_crosstab.plot(kind='bar', stacked=True, colormap='Spectral_r', ax=axes[0])
axes[0].set_title('Cross-tab: Payer Code vs. Number of Emergencies')
axes[0].set_ylabel('Counts')
axes[0].set_xlabel('Payer Code')
# Plot normalized rates
emergency_rate.plot(kind='bar', stacked=True, colormap='Spectral_r', ax=axes[1])
axes[1].set_title('Normalized Rates: Payer Code vs. Number of Emergencies')
axes[1].set_ylabel('Proportions')
axes[1].set_xlabel('Payer Code')
plt.tight_layout()
# plt.savefig(f'plots/crosstab_payer_emergencies.png')
plt.show()
number_emergency_binned 0 1-3 4-5 6-10 11+ payer_code BC 4035 522 47 39 12 CH 128 18 0 0 0 CM 1650 245 31 5 6 CP 2337 182 11 2 1 DM 366 128 20 19 16 FR 1 0 0 0 0 HM 5093 957 165 57 2 MC 28133 3762 390 130 24 MD 2788 523 112 72 37 MP 58 19 2 0 0 OG 790 142 61 31 9 OT 78 14 2 1 0 PO 541 48 3 0 0 SI 48 7 0 0 0 SP 4444 493 44 21 5 UN 2175 225 29 15 4 Unknown 37590 2427 182 50 7 WC 128 7 0 0 0
From earlier, we saw that the DM payer had one of the smallest proportions of the not readmitted class. Interestingly from the cross tab here, DM also appears to be associated with the most number of emergency visits and the smallest proportion of patients with zero emergency visits.
Are there other features that could be added to the data or created from existing features? Which ones?¶
Some options for combining features for simplicity are to group drugs by classes and group diagnosis codes by health category (e.g. cardiovascular, endocrine, etc.). Variables with many categorical levels could be reorganized to group sparce levels into an other category (e.g. payer_code, medical_specialty, and discharge_disposition_id). Another option to consider would be restricting the analysis to adults over 20 years old. The dataset primarily consists of adults and it seems logical that the disease, treatments and outcomes of children might be different than adults.
A. One hot encoding and PCA: First Approach
# Drop columns that may not contribute to analysis
# pca_df = pd.read_csv('data/diabetes+130-us+hospitals+for+years+1999-2008/diabetic_data.csv')
pca_df = pca_df.drop(['encounter_id', 'patient_nbr','weight','max_glu_serum','A1Cresult', 'examide', 'citoglipton'], axis=1)
pca_df
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | ... | tolazamide | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Caucasian | Female | [0-10) | 6 | 25 | 1 | 1 | ? | Pediatrics-Endocrinology | 41 | ... | No | No | No | No | No | No | No | No | No | NO |
| 1 | Caucasian | Female | [10-20) | 1 | 1 | 7 | 3 | ? | ? | 59 | ... | No | Up | No | No | No | No | No | Ch | Yes | >30 |
| 2 | AfricanAmerican | Female | [20-30) | 1 | 1 | 7 | 2 | ? | ? | 11 | ... | No | No | No | No | No | No | No | No | Yes | NO |
| 3 | Caucasian | Male | [30-40) | 1 | 1 | 7 | 2 | ? | ? | 44 | ... | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 4 | Caucasian | Male | [40-50) | 1 | 1 | 7 | 1 | ? | ? | 51 | ... | No | Steady | No | No | No | No | No | Ch | Yes | NO |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 101761 | AfricanAmerican | Male | [70-80) | 1 | 3 | 7 | 3 | MC | ? | 51 | ... | No | Down | No | No | No | No | No | Ch | Yes | >30 |
| 101762 | AfricanAmerican | Female | [80-90) | 1 | 4 | 5 | 5 | MC | ? | 33 | ... | No | Steady | No | No | No | No | No | No | Yes | NO |
| 101763 | Caucasian | Male | [70-80) | 1 | 1 | 7 | 1 | MC | ? | 53 | ... | No | Down | No | No | No | No | No | Ch | Yes | NO |
| 101764 | Caucasian | Female | [80-90) | 2 | 3 | 7 | 10 | MC | Surgery-General | 45 | ... | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 101765 | Caucasian | Male | [70-80) | 1 | 1 | 7 | 6 | ? | ? | 13 | ... | No | No | No | No | No | No | No | No | No | NO |
101766 rows × 43 columns
# Drop columns that may not contribute to analysis
high_cardinality_columns = [col for col in pca_df.columns if pca_df[col].nunique() > 9]
# Step 1: Create frequency maps for all columns
frequency_maps = {col: pca_df[col].value_counts() for col in high_cardinality_columns}
# Step 2: Apply frequency encoding
for col in high_cardinality_columns:
pca_df[f'{col}_encoded'] = pca_df[col].map(frequency_maps[col])
# Apply encoding with fallback for unseen categories
for col in high_cardinality_columns:
pca_df[f'{col}_encoded'] = pca_df[col].map(frequency_maps[col]).fillna(0)
print(pca_df)
race gender age admission_type_id \
0 Caucasian Female [0-10) 6
1 Caucasian Female [10-20) 1
2 AfricanAmerican Female [20-30) 1
3 Caucasian Male [30-40) 1
4 Caucasian Male [40-50) 1
... ... ... ... ...
101761 AfricanAmerican Male [70-80) 1
101762 AfricanAmerican Female [80-90) 1
101763 Caucasian Male [70-80) 1
101764 Caucasian Female [80-90) 2
101765 Caucasian Male [70-80) 1
discharge_disposition_id admission_source_id time_in_hospital \
0 25 1 1
1 1 7 3
2 1 7 2
3 1 7 2
4 1 7 1
... ... ... ...
101761 3 7 3
101762 4 5 5
101763 1 7 1
101764 3 7 10
101765 1 7 6
payer_code medical_specialty num_lab_procedures ... \
0 ? Pediatrics-Endocrinology 41 ...
1 ? ? 59 ...
2 ? ? 11 ...
3 ? ? 44 ...
4 ? ? 51 ...
... ... ... ... ...
101761 MC ? 51 ...
101762 MC ? 33 ...
101763 MC ? 53 ...
101764 MC Surgery-General 45 ...
101765 ? ? 13 ...
medical_specialty_encoded num_lab_procedures_encoded \
0 159 2117
1 49949 1624
2 49949 689
3 49949 2496
4 49949 1925
... ... ...
101761 49949 1925
101762 49949 1297
101763 49949 1802
101764 3099 2376
101765 49949 404
num_medications_encoded number_outpatient_encoded \
0 262 85027
1 4523 85027
2 6086 3594
3 5430 85027
4 4353 85027
... ... ...
101761 5430 85027
101762 4523 85027
101763 4913 8547
101764 3230 85027
101765 900 85027
number_emergency_encoded number_inpatient_encoded diag_1_encoded \
0 90383 67630 95
1 90383 67630 1889
2 90383 19521 285
3 90383 67630 515
4 90383 67630 320
... ... ... ...
101761 90383 67630 851
101762 90383 19521 876
101763 90383 67630 1688
101764 90383 19521 1967
101765 90383 67630 531
diag_2_encoded diag_3_encoded number_diagnoses_encoded
0 358 1423 219
1 1523 69 49474
2 6071 37 10161
3 37 2357 10393
4 85 11555 11393
... ... ... ...
101761 74 285 49474
101762 6752 358 49474
101763 134 214 16
101764 1520 316 49474
101765 290 358 49474
[101766 rows x 58 columns]
# Dropping old high cardinality columns
pca_df.drop(high_cardinality_columns, axis=1, inplace=True)
# Identify all new columns that have 'encoded' in their names
high_cardinality_columns = [col for col in pca_df.columns if 'encoded' in col or 'readmitted' in col]
categorical_columns = pca_df.select_dtypes(include='object').columns
categorical_columns = categorical_columns[0:24]
print(categorical_columns)
Index(['race', 'gender', 'metformin', 'repaglinide', 'nateglinide',
'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide',
'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose',
'miglitol', 'troglitazone', 'tolazamide', 'insulin',
'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-rosiglitazone',
'metformin-pioglitazone', 'change'],
dtype='object')
#one-hot encoding to all categorical columns
df_encoded = pd.get_dummies(pca_df.drop(columns=high_cardinality_columns + ['readmitted']),
columns=categorical_columns,
drop_first=False)
df_encoded = pd.concat([pca_df[high_cardinality_columns] , pca_df[['readmitted']], df_encoded], axis=1)
# Updated dataset
print(df_encoded.info())
print(df_encoded.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 100 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 readmitted 101766 non-null object 1 age_encoded 101766 non-null int64 2 discharge_disposition_id_encoded 101766 non-null int64 3 admission_source_id_encoded 101766 non-null int64 4 time_in_hospital_encoded 101766 non-null int64 5 payer_code_encoded 101766 non-null int64 6 medical_specialty_encoded 101766 non-null int64 7 num_lab_procedures_encoded 101766 non-null int64 8 num_medications_encoded 101766 non-null int64 9 number_outpatient_encoded 101766 non-null int64 10 number_emergency_encoded 101766 non-null int64 11 number_inpatient_encoded 101766 non-null int64 12 diag_1_encoded 101766 non-null int64 13 diag_2_encoded 101766 non-null int64 14 diag_3_encoded 101766 non-null int64 15 number_diagnoses_encoded 101766 non-null int64 16 readmitted 101766 non-null object 17 admission_type_id 101766 non-null int64 18 num_procedures 101766 non-null int64 19 diabetesMed 101766 non-null object 20 race_? 101766 non-null bool 21 race_AfricanAmerican 101766 non-null bool 22 race_Asian 101766 non-null bool 23 race_Caucasian 101766 non-null bool 24 race_Hispanic 101766 non-null bool 25 race_Other 101766 non-null bool 26 gender_Female 101766 non-null bool 27 gender_Male 101766 non-null bool 28 gender_Unknown/Invalid 101766 non-null bool 29 metformin_Down 101766 non-null bool 30 metformin_No 101766 non-null bool 31 metformin_Steady 101766 non-null bool 32 metformin_Up 101766 non-null bool 33 repaglinide_Down 101766 non-null bool 34 repaglinide_No 101766 non-null bool 35 repaglinide_Steady 101766 non-null bool 36 repaglinide_Up 101766 non-null bool 37 nateglinide_Down 101766 non-null bool 38 nateglinide_No 101766 non-null bool 39 nateglinide_Steady 101766 non-null bool 40 nateglinide_Up 101766 non-null bool 41 chlorpropamide_Down 101766 non-null bool 42 chlorpropamide_No 101766 non-null bool 43 chlorpropamide_Steady 101766 non-null bool 44 chlorpropamide_Up 101766 non-null bool 45 glimepiride_Down 101766 non-null bool 46 glimepiride_No 101766 non-null bool 47 glimepiride_Steady 101766 non-null bool 48 glimepiride_Up 101766 non-null bool 49 acetohexamide_No 101766 non-null bool 50 acetohexamide_Steady 101766 non-null bool 51 glipizide_Down 101766 non-null bool 52 glipizide_No 101766 non-null bool 53 glipizide_Steady 101766 non-null bool 54 glipizide_Up 101766 non-null bool 55 glyburide_Down 101766 non-null bool 56 glyburide_No 101766 non-null bool 57 glyburide_Steady 101766 non-null bool 58 glyburide_Up 101766 non-null bool 59 tolbutamide_No 101766 non-null bool 60 tolbutamide_Steady 101766 non-null bool 61 pioglitazone_Down 101766 non-null bool 62 pioglitazone_No 101766 non-null bool 63 pioglitazone_Steady 101766 non-null bool 64 pioglitazone_Up 101766 non-null bool 65 rosiglitazone_Down 101766 non-null bool 66 rosiglitazone_No 101766 non-null bool 67 rosiglitazone_Steady 101766 non-null bool 68 rosiglitazone_Up 101766 non-null bool 69 acarbose_Down 101766 non-null bool 70 acarbose_No 101766 non-null bool 71 acarbose_Steady 101766 non-null bool 72 acarbose_Up 101766 non-null bool 73 miglitol_Down 101766 non-null bool 74 miglitol_No 101766 non-null bool 75 miglitol_Steady 101766 non-null bool 76 miglitol_Up 101766 non-null bool 77 troglitazone_No 101766 non-null bool 78 troglitazone_Steady 101766 non-null bool 79 tolazamide_No 101766 non-null bool 80 tolazamide_Steady 101766 non-null bool 81 tolazamide_Up 101766 non-null bool 82 insulin_Down 101766 non-null bool 83 insulin_No 101766 non-null bool 84 insulin_Steady 101766 non-null bool 85 insulin_Up 101766 non-null bool 86 glyburide-metformin_Down 101766 non-null bool 87 glyburide-metformin_No 101766 non-null bool 88 glyburide-metformin_Steady 101766 non-null bool 89 glyburide-metformin_Up 101766 non-null bool 90 glipizide-metformin_No 101766 non-null bool 91 glipizide-metformin_Steady 101766 non-null bool 92 glimepiride-pioglitazone_No 101766 non-null bool 93 glimepiride-pioglitazone_Steady 101766 non-null bool 94 metformin-rosiglitazone_No 101766 non-null bool 95 metformin-rosiglitazone_Steady 101766 non-null bool 96 metformin-pioglitazone_No 101766 non-null bool 97 metformin-pioglitazone_Steady 101766 non-null bool 98 change_Ch 101766 non-null bool 99 change_No 101766 non-null bool dtypes: bool(80), int64(17), object(3) memory usage: 23.3+ MB None readmitted age_encoded discharge_disposition_id_encoded \ 0 NO 161 989 1 >30 691 60234 2 NO 1657 60234 3 NO 3775 60234 4 NO 9685 60234 admission_source_id_encoded time_in_hospital_encoded payer_code_encoded \ 0 29565 14208 40256 1 57494 17756 40256 2 57494 17224 40256 3 57494 17224 40256 4 57494 14208 40256 medical_specialty_encoded num_lab_procedures_encoded \ 0 159 2117 1 49949 1624 2 49949 689 3 49949 2496 4 49949 1925 num_medications_encoded number_outpatient_encoded ... \ 0 262 85027 ... 1 4523 85027 ... 2 6086 3594 ... 3 5430 85027 ... 4 4353 85027 ... glipizide-metformin_No glipizide-metformin_Steady \ 0 True False 1 True False 2 True False 3 True False 4 True False glimepiride-pioglitazone_No glimepiride-pioglitazone_Steady \ 0 True False 1 True False 2 True False 3 True False 4 True False metformin-rosiglitazone_No metformin-rosiglitazone_Steady \ 0 True False 1 True False 2 True False 3 True False 4 True False metformin-pioglitazone_No metformin-pioglitazone_Steady change_Ch \ 0 True False False 1 True False True 2 True False False 3 True False True 4 True False True change_No 0 True 1 False 2 True 3 False 4 False [5 rows x 100 columns]
## Remove duplicate columns
df_encoded = df_encoded.loc[:, ~df_encoded.T.duplicated()]
# Check for columns with missing values
missing_summary = df_encoded.isnull().sum()
print(missing_summary[missing_summary > 0]) # Columns with missing data
Series([], dtype: int64)
# Factorize all categorical variables
# Select all boolean columns
categorical_columns = list(df_encoded.select_dtypes(include='bool').columns)
# Append 'readmitted' column to the list (if it exists in the DataFrame)
if 'readmitted' in df_encoded.columns:
categorical_columns.append('readmitted')
# Factorize
for col in categorical_columns:
df_encoded[col], mapping = pd.factorize(df_encoded[col])
print(df_encoded.info())
print(df_encoded.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 99 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 readmitted 101766 non-null int64 1 age_encoded 101766 non-null int64 2 discharge_disposition_id_encoded 101766 non-null int64 3 admission_source_id_encoded 101766 non-null int64 4 time_in_hospital_encoded 101766 non-null int64 5 payer_code_encoded 101766 non-null int64 6 medical_specialty_encoded 101766 non-null int64 7 num_lab_procedures_encoded 101766 non-null int64 8 num_medications_encoded 101766 non-null int64 9 number_outpatient_encoded 101766 non-null int64 10 number_emergency_encoded 101766 non-null int64 11 number_inpatient_encoded 101766 non-null int64 12 diag_1_encoded 101766 non-null int64 13 diag_2_encoded 101766 non-null int64 14 diag_3_encoded 101766 non-null int64 15 number_diagnoses_encoded 101766 non-null int64 16 admission_type_id 101766 non-null int64 17 num_procedures 101766 non-null int64 18 diabetesMed 101766 non-null object 19 race_? 101766 non-null int64 20 race_AfricanAmerican 101766 non-null int64 21 race_Asian 101766 non-null int64 22 race_Caucasian 101766 non-null int64 23 race_Hispanic 101766 non-null int64 24 race_Other 101766 non-null int64 25 gender_Female 101766 non-null int64 26 gender_Male 101766 non-null int64 27 gender_Unknown/Invalid 101766 non-null int64 28 metformin_Down 101766 non-null int64 29 metformin_No 101766 non-null int64 30 metformin_Steady 101766 non-null int64 31 metformin_Up 101766 non-null int64 32 repaglinide_Down 101766 non-null int64 33 repaglinide_No 101766 non-null int64 34 repaglinide_Steady 101766 non-null int64 35 repaglinide_Up 101766 non-null int64 36 nateglinide_Down 101766 non-null int64 37 nateglinide_No 101766 non-null int64 38 nateglinide_Steady 101766 non-null int64 39 nateglinide_Up 101766 non-null int64 40 chlorpropamide_Down 101766 non-null int64 41 chlorpropamide_No 101766 non-null int64 42 chlorpropamide_Steady 101766 non-null int64 43 chlorpropamide_Up 101766 non-null int64 44 glimepiride_Down 101766 non-null int64 45 glimepiride_No 101766 non-null int64 46 glimepiride_Steady 101766 non-null int64 47 glimepiride_Up 101766 non-null int64 48 acetohexamide_No 101766 non-null int64 49 acetohexamide_Steady 101766 non-null int64 50 glipizide_Down 101766 non-null int64 51 glipizide_No 101766 non-null int64 52 glipizide_Steady 101766 non-null int64 53 glipizide_Up 101766 non-null int64 54 glyburide_Down 101766 non-null int64 55 glyburide_No 101766 non-null int64 56 glyburide_Steady 101766 non-null int64 57 glyburide_Up 101766 non-null int64 58 tolbutamide_No 101766 non-null int64 59 tolbutamide_Steady 101766 non-null int64 60 pioglitazone_Down 101766 non-null int64 61 pioglitazone_No 101766 non-null int64 62 pioglitazone_Steady 101766 non-null int64 63 pioglitazone_Up 101766 non-null int64 64 rosiglitazone_Down 101766 non-null int64 65 rosiglitazone_No 101766 non-null int64 66 rosiglitazone_Steady 101766 non-null int64 67 rosiglitazone_Up 101766 non-null int64 68 acarbose_Down 101766 non-null int64 69 acarbose_No 101766 non-null int64 70 acarbose_Steady 101766 non-null int64 71 acarbose_Up 101766 non-null int64 72 miglitol_Down 101766 non-null int64 73 miglitol_No 101766 non-null int64 74 miglitol_Steady 101766 non-null int64 75 miglitol_Up 101766 non-null int64 76 troglitazone_No 101766 non-null int64 77 troglitazone_Steady 101766 non-null int64 78 tolazamide_No 101766 non-null int64 79 tolazamide_Steady 101766 non-null int64 80 tolazamide_Up 101766 non-null int64 81 insulin_Down 101766 non-null int64 82 insulin_No 101766 non-null int64 83 insulin_Steady 101766 non-null int64 84 insulin_Up 101766 non-null int64 85 glyburide-metformin_Down 101766 non-null int64 86 glyburide-metformin_No 101766 non-null int64 87 glyburide-metformin_Steady 101766 non-null int64 88 glyburide-metformin_Up 101766 non-null int64 89 glipizide-metformin_No 101766 non-null int64 90 glipizide-metformin_Steady 101766 non-null int64 91 glimepiride-pioglitazone_No 101766 non-null int64 92 glimepiride-pioglitazone_Steady 101766 non-null int64 93 metformin-rosiglitazone_No 101766 non-null int64 94 metformin-rosiglitazone_Steady 101766 non-null int64 95 metformin-pioglitazone_No 101766 non-null int64 96 metformin-pioglitazone_Steady 101766 non-null int64 97 change_Ch 101766 non-null int64 98 change_No 101766 non-null int64 dtypes: int64(98), object(1) memory usage: 76.9+ MB None readmitted age_encoded discharge_disposition_id_encoded \ 0 0 161 989 1 1 691 60234 2 0 1657 60234 3 0 3775 60234 4 0 9685 60234 admission_source_id_encoded time_in_hospital_encoded payer_code_encoded \ 0 29565 14208 40256 1 57494 17756 40256 2 57494 17224 40256 3 57494 17224 40256 4 57494 14208 40256 medical_specialty_encoded num_lab_procedures_encoded \ 0 159 2117 1 49949 1624 2 49949 689 3 49949 2496 4 49949 1925 num_medications_encoded number_outpatient_encoded ... \ 0 262 85027 ... 1 4523 85027 ... 2 6086 3594 ... 3 5430 85027 ... 4 4353 85027 ... glipizide-metformin_No glipizide-metformin_Steady \ 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 glimepiride-pioglitazone_No glimepiride-pioglitazone_Steady \ 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 metformin-rosiglitazone_No metformin-rosiglitazone_Steady \ 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 metformin-pioglitazone_No metformin-pioglitazone_Steady change_Ch \ 0 0 0 0 1 0 0 1 2 0 0 0 3 0 0 1 4 0 0 1 change_No 0 0 1 1 2 0 3 1 4 1 [5 rows x 99 columns]
columns_and_dtypes_df = df_encoded.dtypes.reset_index()
columns_and_dtypes_df.columns = ['Column', 'Data Type']
print(columns_and_dtypes_df)
Left_out = pca_df[['readmitted','diabetesMed']]
# Factorize
for col in Left_out:
df_encoded[col], mapping = pd.factorize(df_encoded[col])
Column Data Type 0 readmitted int64 1 age_encoded int64 2 discharge_disposition_id_encoded int64 3 admission_source_id_encoded int64 4 time_in_hospital_encoded int64 .. ... ... 94 metformin-rosiglitazone_Steady int64 95 metformin-pioglitazone_No int64 96 metformin-pioglitazone_Steady int64 97 change_Ch int64 98 change_No int64 [99 rows x 2 columns]
from sklearn.ensemble import RandomForestClassifier
# Quick feature importance analysis
X = df_encoded.drop(['readmitted'], axis=1)
y = df_encoded['readmitted']
model = RandomForestClassifier(random_state=42)
model.fit(X, y)
# Feature importance
feature_importance = pd.Series(model.feature_importances_, index=X.columns)
feature_importance.sort_values(ascending=False, inplace=True)
for feature,importance in feature_importance.items():
if importance >= .02:
print(f"{feature}: {importance}")
# Save feature names with importance >= 0.02 into a list
important_features = feature_importance[feature_importance >= 0.02].index.tolist()
# Now, `important_features` contains the feature names
print(important_features)
num_lab_procedures_encoded: 0.0818232748807238 diag_1_encoded: 0.077213240832635 diag_2_encoded: 0.07709493956968286 diag_3_encoded: 0.0750309175655698 num_medications_encoded: 0.07414044621294007 time_in_hospital_encoded: 0.055917804731340046 age_encoded: 0.04668385312935144 number_inpatient_encoded: 0.042892540678905106 medical_specialty_encoded: 0.04165801624714001 discharge_disposition_id_encoded: 0.04010798391462907 payer_code_encoded: 0.039672204111507306 num_procedures: 0.03690513874689049 number_diagnoses_encoded: 0.035261827866249026 admission_type_id: 0.027663855584695227 admission_source_id_encoded: 0.024541875672926948 ['num_lab_procedures_encoded', 'diag_1_encoded', 'diag_2_encoded', 'diag_3_encoded', 'num_medications_encoded', 'time_in_hospital_encoded', 'age_encoded', 'number_inpatient_encoded', 'medical_specialty_encoded', 'discharge_disposition_id_encoded', 'payer_code_encoded', 'num_procedures', 'number_diagnoses_encoded', 'admission_type_id', 'admission_source_id_encoded']
############################# PCA for Feature Reduction ##############################
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Step 1: Preprocessing - Standardize the Data
# Assuming X is your feature matrix
df_reduced = pca_df[important_features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_reduced) # Standardize the features
# Step 2: Fit PCA
pca = PCA(n_components=13) # Retain 95% of variance
X_pca = pca.fit_transform(X_scaled)
# Step 3: Inspect Results
explained_variance_ratio = pca.explained_variance_ratio_ # Variance explained by each component
n_components = pca.n_components_ # Number of components chosen
print(f"Number of components selected: {n_components}")
print(f"Explained variance ratio: {explained_variance_ratio}")
print(f"Total explained variance: {sum(explained_variance_ratio):.2f}")
# Step 4: Create a DataFrame for PCA-transformed data
X_pca_df = pd.DataFrame(X_pca, columns=[f"PC{i+1}" for i in range(n_components)])
# Save for further analysis
# X_pca_df.to_csv("data/pca_transformed_data.csv", index=False)
Number of components selected: 13 Explained variance ratio: [0.12686543 0.1182879 0.08422112 0.08243677 0.06919534 0.06802914 0.06483953 0.06103524 0.05854137 0.05545738 0.05029781 0.04588131 0.04420946] Total explained variance: 0.93
# Visualize explained variance
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(explained_variance_ratio)+1), explained_variance_ratio, marker='o', linestyle='--')
plt.title("Explained Variance by Principal Components")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")
# plt.savefig(f'plots/PCAexplainedVariance_factorized.png')
plt.show()
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.1)
plt.title("t-SNE Visualization")
plt.colorbar(label="Readmitted")
# plt.savefig(f'plots/tSNE_factorized.png')
plt.show()
from sklearn.model_selection import train_test_split
# Plot PC1 vs PC2 by response class
pc1 = X_pca[:, 0] # First principal component
pc2 = X_pca[:, 1] # Second principal component
Read_Reduced = df_encoded['readmitted']
pca_df = pd.DataFrame({
'PC1': pc1,
'PC2': pc2,
'Readmitted': Read_Reduced
})
# Perform stratified sampling (10%)
pca_sample, _ = train_test_split(
pca_df,
test_size=0.9, # Keep 10% of the data
stratify=pca_df['Readmitted'], # Stratify by the 'Readmitted' column
random_state=1234
)
# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Scatter plot with the stratified sample
sns.scatterplot(
data=pca_sample,
x='PC1',
y='PC2',
hue='Readmitted',
palette='muted',
alpha=0.7,
ax=ax1
)
ax1.set_title('Scatter Plot (10% Stratified Sample)')
ax1.set_xlabel('Principal Component 1 (PC1)')
ax1.set_ylabel('Principal Component 2 (PC2)')
ax1.legend(title='Readmitted')
# Get axis limits from scatterplot
x_limits = ax1.get_xlim()
y_limits = ax1.get_ylim()
# KDE plot with full data, matching axis limits
sns.kdeplot(
data=pca_df,
x='PC1',
y='PC2',
hue='Readmitted',
palette='muted',
ax=ax2,
legend=True
)
ax2.set_title('KDE Plot (Matched Axis Limits)')
ax2.set_xlabel('Principal Component 1 (PC1)')
ax2.set_ylabel('')
# Apply matching axis limits
ax2.set_xlim(x_limits)
ax2.set_ylim(y_limits)
# Add a figure-wide title
fig.suptitle('PC1 vs. PC2 by Readmitted', fontsize=14)
# Adjust layout and save the figure
plt.tight_layout(rect=[0, 0, 1, 0.95]) # Adjust rect to prevent overlap with suptitle
# plt.savefig('plots/PC1vPC2_by_readmitted_factorized.png')
plt.show()
B. One-Hot Encoding and PCA: Alternative Approach
# Remove columns with many missing values and patient IDs
columns_to_drop = ['patient_nbr'] #, 'weight', 'max_glu_serum', 'A1Cresult']
df_encoded = df_clean.drop(columns= columns_to_drop)
# Preprocess diag_1, diag_2, diag_3 combining all
for col in ['diag_1', 'diag_2', 'diag_3']:
df_encoded[col] = df_encoded[col].str.split('.').str[0] # Drop decimals and digits after
# print(df_encoded[['diag_1', 'diag_2', 'diag_3']].head(20))
# # Replace NaNs in 'medical_specialty' and 'payer_code' with 'unknown'
# categorical_na_columns = ['medical_specialty', 'payer_code']
# for col in categorical_na_columns:
# df_encoded[col] = df_encoded[col].fillna('unknown')
# Drop rows with NaN values (race and diagnoses cols)
df_encoded = df_encoded.dropna()
# Remove target (split into X and y, for plotting with y later)
X_df_encoded = df_encoded.drop(columns='readmitted')
y_df_encoded = df_encoded['readmitted']
# Handle categorical variables with one-hot encoding
categorical_columns = X_df_encoded.select_dtypes(include=['object', 'category']).columns
X_df_encoded = pd.get_dummies(X_df_encoded, columns=categorical_columns, drop_first=False)
# Standardize the dataset
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_df_encoded)
# Apply PCA
from sklearn.decomposition import PCA
pca = PCA() # Keep all components initially
X_pca = pca.fit_transform(X_scaled)
# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
# Plot cumulative variance
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by PCA Components')
plt.grid()
# plt.savefig(f'plots/PCAexplainedVariance.png')
plt.show()
# Determine the number of components to explain ~80% of variance
n_components_80 = np.argmax(cumulative_variance >= 0.80) + 1 # +1 because indices start at 0
print(f"Number of original features: {X_df_encoded.shape[1]}\nNumber of components to explain ~80% of variance: {n_components_80}")
Number of original features: 2428 Number of components to explain ~80% of variance: 1692
from sklearn.model_selection import train_test_split
# Plot PC1 vs PC2 by response class
pc1 = X_pca[:, 0] # First principal component
pc2 = X_pca[:, 1] # Second principal component
pca_df = pd.DataFrame({
'PC1': pc1,
'PC2': pc2,
'Readmitted': y_df_encoded
})
# Perform stratified sampling (10%)
pca_sample, _ = train_test_split(
pca_df,
test_size=0.9, # Keep 10% of the data
stratify=pca_df['Readmitted'], # Stratify by the 'Readmitted' column
random_state=1234
)
# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Scatter plot with the stratified sample
sns.scatterplot(
data=pca_sample,
x='PC1',
y='PC2',
hue='Readmitted',
palette='muted',
alpha=0.7,
ax=ax1
)
ax1.set_title('Scatter Plot (10% Stratified Sample)')
ax1.set_xlabel('Principal Component 1 (PC1)')
ax1.set_ylabel('Principal Component 2 (PC2)')
ax1.legend(title='Readmitted')
# Get axis limits from scatterplot
x_limits = ax1.get_xlim()
y_limits = ax1.get_ylim()
# KDE plot with full data, matching axis limits
sns.kdeplot(
data=pca_df,
x='PC1',
y='PC2',
hue='Readmitted',
palette='muted',
ax=ax2,
legend=True
)
ax2.set_title('KDE Plot')
ax2.set_xlabel('Principal Component 1 (PC1)')
ax2.set_ylabel('')
# Apply matching axis limits
ax2.set_xlim(x_limits)
ax2.set_ylim(y_limits)
# Add a figure-wide title
fig.suptitle('PC1 vs. PC2 by Readmitted', fontsize=14)
# Adjust layout and save the figure
plt.tight_layout(rect=[0, 0, 1, 0.95]) # Adjust rect to prevent overlap with suptitle
# plt.savefig('plots/PC1vPC2_by_readmitted.png')
plt.show()
These plots display only the first two principal components (PCs) out of the 1690 required to explain 80% of the variance. While these two PCs do not separate the response classes, they reveal distinct patterns in the data distribution:
- Non-readmitted patients spread widely extending to the exterior boundaries of the plot.
- Early readmitted patients cluster more densely near the center, suggesting a smaller variance in their principal component scores.
- Late readmitted patients have a broader distribution than early readmitted patients but less spread than the non-readmitted group.
Although the PCs are less interpretable, PCA successfully reduced the dataset from the original 2453 one-hot-encoded features. This dimensionality reduction may be useful for further analyses and modeling.